Vision transformer (ViT) has recently showed its strong capability in achieving comparable results to convolutional neural networks (CNNs) on image classification. However, vanilla ViT simply inherits the same architecture from the natural language processing directly, which is often not optimized for vision applications. Motivated by this, in this paper, we propose a new architecture that adopts the pyramid structure and employ a novel regional-to-local attention rather than global self-attention in vision transformers. More specifically, our model first generates regional tokens and local tokens from an image with different patch sizes, where each regional token is associated with a set of local tokens based on the spatial location. The regional-to-local attention includes two steps: first, the regional self-attention extract global information among all regional tokens and then the local self-attention exchanges the information among one regional token and the associated local tokens via self-attention. Therefore, even though local self-attention confines the scope in a local region but it can still receive global information. Extensive experiments on three vision tasks, including image classification, object detection and action recognition, show that our approach outperforms or is on par with state-of-the-art ViT variants including many concurrent works. Our source codes and models will be publicly available.
翻译:视觉变压器(ViT)最近展示了在图像分类方面与进化神经网络(CNNs)取得可比结果的强大能力;然而,Vanilla ViT只是直接从自然语言处理中继承同一结构,而自然语言处理往往不是最优化的,因此往往不能直接用于视觉应用。在本文件的推动下,我们提出一个新的结构,采用金字塔结构,在视觉变压器中采用新的区域对地方的关注,而不是全球自我关注。更具体地说,我们的模型首先从一个具有不同补丁大小的图像中产生区域象征物和地方象征物,其中每个区域象征物都与一套基于空间位置的当地象征物有关。区域对地方的关注包括两个步骤:首先,区域自用信息在所有区域象征物中提取全球信息,然后由地方自我关注通过自我保存,在一个区域象征物和相关的当地象征物之间交流信息。因此,即使本地自我关注限制了一个区域的范围,但它仍然可以接收全球信息。在三种视觉任务上进行广泛的实验,包括图像分类、物体探测和动作识别和动作识别,将展示我们现有的许多源模式。