With the growing adoption of neural network models in safety-critical applications such as autonomous driving and robotics, reliability has become a critical metric alongside performance and energy efficiency. Algorithm-based fault tolerance (ABFT) strategies, designed atop standard chip architectures, are both cost-effective and adaptable to different architectures, making them particularly attractive. However, traditional ABFT relies on precise error metrics, triggering error recovery procedures even for minor computational deviations. These minor errors often do not impact the accuracy of the neural network model due to the inherent fault tolerance. To address this inefficiency, we propose an approximate ABFT approach, called ApproxABFT, which initiates error recovery only when computational errors are significant. This approach avoids unnecessary recovery procedures, streamlines the error recovery process, and focuses on correcting impactful errors, ultimately enhancing recovery quality. Additionally, ApproxABFT incorporates a fine-grained blocking strategy to smooth error sensitivity across layers within neural network models. Experimental results demonstrate that ApproxABFT reduces the computing overhead by 67.83\% and improves the tolerable bit error rate by an order of magnitude on average compared to classical accurate ABFT.
翻译:暂无翻译