Hardware faults on the regular 2-D computing array of a typical deep learning accelerator (DLA) can lead to dramatic prediction accuracy loss. Prior redundancy design approaches typically have each homogeneous redundant processing element (PE) to mitigate faulty PEs for a limited region of the 2-D computing array rather than the entire computing array to avoid the excessive hardware overhead. However, they fail to recover the computing array when the number of faulty PEs in any region exceeds the number of redundant PEs in the same region. The mismatch problem deteriorates when the fault injection rate rises and the faults are unevenly distributed. To address the problem, we propose a hybrid computing architecture (HyCA) for fault-tolerant DLAs. It has a set of dot-production processing units (DPPUs) to recompute all the operations that are mapped to the faulty PEs despite the faulty PE locations. According to our experiments, HyCA shows significantly higher reliability, scalability, and performance with less chip area penalty when compared to the conventional redundancy approaches. Moreover, by taking advantage of the flexible recomputing, HyCA can also be utilized to scan the entire 2-D computing array and detect the faulty PEs effectively at runtime.
翻译:典型的深层学习加速器(DLA)常规二维计算阵列上的硬件故障可能导致惊人的预测准确性损失。 先前的冗余设计方法通常有每种同质的冗余处理元件(PE), 以缓解二维计算阵列中有限区域有缺陷的 PE, 而不是整个计算阵列, 以避免过度的硬件管理。 但是, 当任何区域有缺陷的 PE 数量超过同一区域的冗余 PE 数量时, 它们无法恢复计算数组。 当错误注入率上升和错误分布不均时, 错配问题会恶化。 为了解决这个问题, 我们建议为错容 DLA 建立混合计算结构(HyCA ) 。 它有一套多点生产处理元件(DPPU) 来重新计算错误的 PE 操作, 尽管 PE 位置有缺陷 。 根据我们的实验, HyCA 显示, 与常规的冗余方法相比, 其可靠性、 可缩度和性区域处罚性要高得多。 此外, 我们利用了灵活重算算算算法, 利用了 和有效扫描的 PHCA 。