Vision Transformers (ViTs) with outstanding performance becomes a popular backbone of deep learning models for the main-stream vision tasks including classification, object detection, and segmentation. Other than the performance, reliability is also a critical metric for the adoption of ViTs in safety-critical applications such as autonomous driving and robotics. With the observation that the major computing blocks in ViTs such as multi-head attention and feed forward are usually performed with general matrix multiplication (GEMM), we propose to adopt a classical algorithm-based fault tolerance (ABFT) strategy originally developed for GEMM to protect ViTs against soft errors in the underlying computing engines. Unlike classical ABFT that will invoke the expensive error recovery procedure whenever computing errors are detected, we leverage the inherent fault-tolerance of ViTs and propose an approximate ABFT, namely ApproxABFT, to invoke the error recovery procedure only when the computing errors are significant enough, which skips many useless error recovery procedures and simplifies the overall GEMM error recovery. According to our experiments, ApproxABFT reduces the computing overhead by 25.92% to 81.62% and improves the model accuracy by 2.63% to 72.56% compared to the baseline ABFT.
翻译:除了性能外,可靠性也是在自动驾驶和机器人等安全关键应用中采用ViT的关键衡量标准。关于ViT中的主要计算机块,如多头关注和前进通常使用一般矩阵倍增(GEMM)进行的意见,我们提议采用一种传统的基于算法的错误容忍(ABFT)战略(ABFT),最初是为GEMM开发的,目的是保护ViT不受基本计算引擎软误差的影响。与古典ABFT不同,ABFT将在发现计算错误时采用昂贵的错误回收程序,我们利用ViTs固有的误差容忍度,并提议大约ABFT(即ApproxABFT)仅在计算错误足够大的情况下才使用错误回收程序(ApproxABFT),它跳过许多无用的错误回收程序,并简化了GEMM的总错误回收过程。根据我们的实验,ApproxABFT将计算间接费用减少25.92%至81.62%,并将模型的精确度从A63%提高到73.%。