In this paper, we present Co-scale conv-attentional image Transformers (CoaT), a Transformer-based image classifier equipped with co-scale and conv-attentional mechanisms. First, the co-scale mechanism maintains the integrity of Transformers' encoder branches at individual scales, while allowing representations learned at different scales to effectively communicate with each other; we design a series of serial and parallel blocks to realize the co-scale mechanism. Second, we devise a conv-attentional mechanism by realizing a relative position embedding formulation in the factorized attention module with an efficient convolution-like implementation. CoaT empowers image Transformers with enriched multi-scale and contextual modeling capabilities. On ImageNet, relatively small CoaT models attain superior classification results compared with similar-sized convolutional neural networks and image/vision Transformers. The effectiveness of CoaT's backbone is also illustrated on object detection and instance segmentation, demonstrating its applicability to downstream computer vision tasks.
翻译:在本文中,我们介绍一个基于变压器的图像分类器(CoaT),这是一个基于变压器的图像分类器,配有共标和共控机制。首先,共同标的机制保持每个规模的变压器编码器分支的完整性,同时允许在不同规模上所学的表达方式有效地相互交流;我们设计了一系列序列块和平行块以实现共同标尺机制。第二,我们设计了一个共标机制,将相对位置的配方嵌入集成成的模块中,并有一个高效的相控式类似的实施。CoaT赋予图像变压器以富集多尺度和上下文模型的能力。在图像网络上,相对较小的 CoaT模型与类似规模的变压器神经网络和图像/视觉变压器相比,获得较高的分类结果。Coat的骨架在物体探测和实例分割上也展示了效果,表明其对下游计算机视觉任务的适用性。