Vision Transformers (ViT) have shown their competitive advantages performance-wise compared to convolutional neural networks (CNNs) though they often come with high computational costs. To this end, previous methods explore different attention patterns by limiting a fixed number of spatially nearby tokens to accelerate the ViT's multi-head self-attention (MHSA) operations. However, such structured attention patterns limit the token-to-token connections to their spatial relevance, which disregards learned semantic connections from a full attention mask. In this work, we propose a novel approach to learn instance-dependent attention patterns, by devising a lightweight connectivity predictor module to estimate the connectivity score of each pair of tokens. Intuitively, two tokens have high connectivity scores if the features are considered relevant either spatially or semantically. As each token only attends to a small number of other tokens, the binarized connectivity masks are often very sparse by nature and therefore provide the opportunity to accelerate the network via sparse computations. Equipped with the learned unstructured attention pattern, sparse attention ViT (Sparsifiner) produces a superior Pareto-optimal trade-off between FLOPs and top-1 accuracy on ImageNet compared to token sparsity. Our method reduces 48% to 69% FLOPs of MHSA while the accuracy drop is within 0.4%. We also show that combining attention and token sparsity reduces ViT FLOPs by over 60%.
翻译:视觉Transformer(ViT)表现出了它们在性能方面与卷积神经网络(CNN)相比的竞争优势,但往往具有高计算成本。为此,先前的方法通过限制一定数量的空间相邻令牌来探索不同的注意力模式,以加速ViT的多头自注意(MHSA)操作。然而,这些结构化的注意力模式将令牌到令牌的连接限制在它们的空间相关性中,忽视了从完整的注意力蒙版中学习的语义连接。在这项工作中,我们提出了一种学习实例依赖的注意力模式的新方法,通过设计一个轻量级的连接性预测模块来估计每对令牌的连接性得分。直观地说,如果认为特征在空间或语义上相关,则两个标记具有较高的连接得分。由于每个标记仅与少量其他标记相关,因此二值化的连接掩码通常具有很高的稀疏性,并因此提供了通过稀疏计算加速网络的机会。配备了学习的非结构化注意力模式,稀疏注意ViT(Sparsifiner)在ImageNet上的FLOPs和top-1准确性之间产生了优越的 Pareto 最优折衷。与token稀疏相比,我们的方法减少了MHSA的48%到69%的FLOPs,而准确度下降在0.4%以内。我们还表明,注意力和标记稀疏的结合将ViT的FLOPs降低了60%以上。