Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate the role of the spatial dimension conversion and its effectiveness on the transformer-based architecture. We particularly attend the dimension reduction principle of CNNs; as the depth increases, a conventional CNN increases channel dimension and decreases spatial dimensions. We empirically show that such a spatial dimension reduction is beneficial to a transformer architecture as well, and propose a novel Pooling-based Vision Transformer (PiT) upon the original ViT model. We show that PiT achieves the improved model capability and generalization performance against ViT. Throughout the extensive experiments, we further show PiT outperforms the baseline on several tasks such as image classification, object detection and robustness evaluation. Source codes and ImageNet models are available at https://github.com/naver-ai/pit
翻译:视觉变异器(VIT)将变压器的应用范围从语言处理扩大到计算机的变压器应用范围,将变压器的应用范围从语言处理扩大到计算机的视觉任务,作为针对现有变动神经网络(CNN)的替代结构。由于以变压器为基础的结构对于计算机的视觉建模来说是创新的,因此对有效结构的设计公约的研究较少。从CNN的成功设计原则来看,我们调查空间维度转换的作用及其对变压器结构在变压器结构中的有效性。我们特别参与了CNN的降低维度原则;随着深度的提高,常规CNN的CNN频道增加了频道的维度,降低了空间维度。我们从经验上表明,这种空间尺寸的减压也有益于变压器结构,并在原VT模型上提出了一个新的基于集合的视觉变压器(PiT)。我们显示,PiT在广泛的实验中,在图像分类、物体探测和坚固度评估等几项任务上,比基准的基线还要高。源码和图像网络模型见https://github.com-naver-ai/pit/pit/pit/pital/pit