Vision Transformers (ViTs) have achieved state-of-the-art performance on various vision tasks. However, ViTs' self-attention module is still arguably a major bottleneck, limiting their achievable hardware efficiency. Meanwhile, existing accelerators dedicated to NLP Transformers are not optimal for ViTs. This is because there is a large difference between ViTs and NLP Transformers: ViTs have a relatively fixed number of input tokens, whose attention maps can be pruned by up to 90% even with fixed sparse patterns; while NLP Transformers need to handle input sequences of varying numbers of tokens and rely on on-the-fly predictions of dynamic sparse attention patterns for each input to achieve a decent sparsity (e.g., >=50%). To this end, we propose a dedicated algorithm and accelerator co-design framework dubbed ViTCoD for accelerating ViTs. Specifically, on the algorithm level, ViTCoD prunes and polarizes the attention maps to have either denser or sparser fixed patterns for regularizing two levels of workloads without hurting the accuracy, largely reducing the attention computations while leaving room for alleviating the remaining dominant data movements; on top of that, we further integrate a lightweight and learnable auto-encoder module to enable trading the dominant high-cost data movements for lower-cost computations. On the hardware level, we develop a dedicated accelerator to simultaneously coordinate the enforced denser/sparser workloads and encoder/decoder engines for boosted hardware utilization. Extensive experiments and ablation studies validate that ViTCoD largely reduces the dominant data movement costs, achieving speedups of up to 235.3x, 142.9x, 86.0x, 10.1x, and 6.8x over general computing platforms CPUs, EdgeGPUs, GPUs, and prior-art Transformer accelerators SpAtten and Sanger under an attention sparsity of 90%, respectively.
翻译:视觉变异器(ViTs)在各种视觉任务中达到了最先进的表现。 然而, ViTs 的自我关注模块仍然可以说是一个主要的同步瓶颈, 限制了它们可以实现的硬件效率。 同时, NLP 变异器的现有加速器对于ViTs来说并不理想。 这是因为 ViTs 和 NLP 变异器之间有很大的差别: ViTs 拥有相对固定数量的输入标牌, 其关注地图可以由90%甚至固定的稀疏模式排列; 而 NLP 的自我关注模块需要处理数量不等的输入序列, 基本是一个数量不等的代数, 限制它们实现可以实现的同步变异变的动态关注模式( 例如 +50 % ) 。 为此, 我们建议一个专门的算法和加速变异变异变器联合设计框架, 加速 ViTCODs 的注意度, 具体来说, ViTCDDD 和极化的注意度地图, 要么是更低的更低的 或更低的代数级的代变异变变变变变数 。