Transformers have shown great potential in various computer vision tasks owing to their strong capability in modeling long-range dependency using the self-attention mechanism. Nevertheless, vision transformers treat an image as 1D sequence of visual tokens, lacking an intrinsic inductive bias (IB) in modeling local visual structures and dealing with scale variance. Alternatively, they require large-scale training data and longer training schedules to learn the IB implicitly. In this paper, we propose a novel Vision Transformer Advanced by Exploring intrinsic IB from convolutions, \ie, ViTAE. Technically, ViTAE has several spatial pyramid reduction modules to downsample and embed the input image into tokens with rich multi-scale context by using multiple convolutions with different dilation rates. In this way, it acquires an intrinsic scale invariance IB and is able to learn robust feature representation for objects at various scales. Moreover, in each transformer layer, ViTAE has a convolution block in parallel to the multi-head self-attention module, whose features are fused and fed into the feed-forward network. Consequently, it has the intrinsic locality IB and is able to learn local features and global dependencies collaboratively. Experiments on ImageNet as well as downstream tasks prove the superiority of ViTAE over the baseline transformer and concurrent works. Source code and pretrained models will be available at GitHub.
翻译:各种计算机视觉任务中,变异器显示出巨大的潜力,因为它们在利用自我注意机制模拟远程依赖性方面具有很强的模型模型。然而,视觉变异器将图像作为视觉象征的1D序列处理,在模拟本地视觉结构和处理规模差异方面缺乏内在的感应偏差(IB ) 。 或者,它们需要大型培训数据和较长的培训时间表来隐含地学习IB。在本文件中,我们提议了一个新型的愿景变异器先进器,通过探索从聚合、\ie、VitaE 的内在IB。技术上,VitaE拥有几个空间金字塔降缩缩到下模组的模块,并将输入图像嵌入丰富的多尺度背景符号中,通过使用不同变异率的多重变异变(IB ), 从而获得内在的IB 规模, 并能够学习不同尺度前天体的物体的强性特征。此外, VitATE在每一个变异器层里, 与多头自我留念模式平行, 其功能被连接并被反馈向上网络。因此, 将输入输入输入的输入图像图像图像图像图像图像图像图像图象系统, 将成功定位系统, 成为全球的底图层,可以学习。