The recently developed vision transformer (ViT) has achieved promising results on image classification compared to convolutional neural networks. Inspired by this, in this paper, we study how to learn multi-scale feature representations in transformer models for image classification. To this end, we propose a dual-branch transformer to combine image patches (i.e., tokens in a transformer) of different sizes to produce stronger image features. Our approach processes small-patch and large-patch tokens with two separate branches of different computational complexity and these tokens are then fused purely by attention multiple times to complement each other. Furthermore, to reduce computation, we develop a simple yet effective token fusion module based on cross attention, which uses a single token for each branch as a query to exchange information with other branches. Our proposed cross-attention only requires linear time for both computational and memory complexity instead of quadratic time otherwise. Extensive experiments demonstrate that our approach performs better than or on par with several concurrent works on vision transformer, in addition to efficient CNN models. For example, on the ImageNet1K dataset, with some architectural changes, our approach outperforms the recent DeiT by a large margin of 2\% with a small to moderate increase in FLOPs and model parameters. Our source codes and models are available at \url{https://github.com/IBM/CrossViT}.
翻译:最近开发的视觉变压器(ViT)在图像分类方面,与进化神经网络相比,在图像分类方面取得了大有希望的成果。在本文件的启发下,我们研究如何在变压器模型中学习图像分类的多尺度特征显示。为此,我们提议了一种双分支变压器,将不同尺寸的图像补丁(即变压器中的标语)结合起来,以产生更强的图像特征。我们的方法处理小批量和大批量符号,分两个不同的分支,不同的计算复杂度和这些符号,然后完全通过多倍的注意来结合,互相补充。此外,为了减少计算,我们开发了一个基于交叉注意的简单而有效的代号聚合模块,每个分支使用一个单一的标语,作为与其他分支交流信息的查询。我们提议的交叉使用只需要线性时间来计算和记忆复杂性,而不是象牙变变缩时间。广泛的实验表明,除了高效的CNN模型外,我们的视觉变压器比重比重工作比重更多。例如,在图像Net1K数据模型上,OPILSDLS, 和我们现有的大基底基的模型, 方法比值将一些。