Several recent studies have demonstrated that attention-based networks, such as Vision Transformer (ViT), can outperform Convolutional Neural Networks (CNNs) on several computer vision tasks without using convolutional layers. This naturally leads to the following questions: Can a self-attention layer of ViT express any convolution operation? In this work, we prove that a single ViT layer with image patches as the input can perform any convolution operation constructively, where the multi-head attention mechanism and the relative positional encoding play essential roles. We further provide a lower bound on the number of heads for Vision Transformers to express CNNs. Corresponding with our analysis, experimental results show that the construction in our proof can help inject convolutional bias into Transformers and significantly improve the performance of ViT in low data regimes.
翻译:最近的几项研究表明,关注型网络,如愿景变换器(VIT),可以在不使用进化层的情况下,在几项计算机视觉任务上超越进化神经网络(CNNs),这自然导致以下问题:VIT的自我关注层能表达任何进化操作吗?在这项工作中,我们证明一个具有图像补丁的单一维T层作为投入可以建设性地实施任何进化操作,多头关注机制和相对位置编码可以发挥关键作用。我们进一步为愿景变换器的负责人人数提供了较低的约束,以表达CNNs。根据我们的分析,实验结果表明,我们的证据构建可以帮助将进化偏差注入变换器,并大大改善VIT在低数据系统中的性能。