Vision Transformers (ViTs) with outstanding performance becomes a popular backbone of deep learning models for the main-stream vision tasks including classification, object detection, and segmentation. Other than the performance, reliability is also a critical metric for the adoption of ViTs in safety-critical applications such as autonomous driving and robotics. With the observation that the major computing blocks in ViTs such as multi-head attention and feed forward are usually performed with general matrix multiplication (GEMM), we propose a classical algorithm-based fault tolerance (ABFT) strategy originally developed for GEMM to protect ViTs against soft errors in the underlying computing engines. Unlike classical ABFT that will invoke the expensive error recovery procedure whenever computing errors are detected, we leverage the inherent fault-tolerance of ViTs and propose an approximate ABFT, namely ApproxABFT, to invoke the error recovery procedure only when the computing errors are significant enough, which skips many useless error recovery procedures and simplifies the overall GEMM error recovery. Meanwhile, it also relaxes the error threshold in error recovery procedure and ignores minor computing errors, which reduces the error recovery complexity and improves the error recovery quality. In addition, we also apply a fine-grained blocking strategy to ApproxABFT and split GEMM with distinct sizes into smaller sub blocks such that it can smooth the error thresholds across ViTs and further improve the error recovery quality. According to our experiments, the ApproxABFT reduces the computing overhead by 25.92\% to 81.62\% and improves the model accuracy by 2.63\% to 72.56\% compared to the baseline ABFT while the blocking optimization further reduces the computing overhead by 6.56\% to 73.5\% with comparable accuracy.
翻译:暂无翻译