Attention-based transformer networks have demonstrated promising potential as their applications extend from natural language processing to vision. However, despite the recent improvements, such as sub-quadratic attention approximation and various training enhancements, the compact vision transformers to date using the regular attention still fall short in comparison with its convnet counterparts, in terms of \textit{accuracy,} \textit{model size}, \textit{and} \textit{throughput}. This paper introduces a compact self-attention mechanism that is fundamental and highly generalizable. The proposed method reduces redundancy and improves efficiency on top of the existing attention optimizations. We show its drop-in applicability for both the regular attention mechanism and some most recent variants in vision transformers. As a result, we produced smaller and faster models with the same or better accuracies.
翻译:关注型变压器网络的应用程序从自然语言处理到视觉,都显示出了充满希望的潜力。然而,尽管最近取得了一些改进,例如次赤道关注近似值和各种培训强化,但迄今为止使用定期关注的紧凑视觉变压器,与其配置对等器相比,在\ textit{curity,}\ textit{model size},\ textit{and}\ textit{text{toput}方面,仍然远远不够。本文引入了一个基本和高度通用的契约式自我注意机制。拟议方法可以减少冗余,在现有关注优化的基础上提高效率。我们展示了它对于常规关注机制以及一些最新变压器的可减少适用性。结果就是,我们制作了规模较小、速度更快的模型,具有相同或更好的理解性。