Transformers have recently gained significant attention in the computer vision community. However, the lack of scalability of self-attention mechanisms with respect to image size has limited their wide adoption in state-of-the-art vision backbones. In this paper we introduce an efficient and scalable attention model we call multi-axis attention, which consists of two aspects: blocked local and dilated global attention. These design choices allow global-local spatial interactions on arbitrary input resolutions with only linear complexity. We also present a new architectural element by effectively blending our proposed attention model with convolutions, and accordingly propose a simple hierarchical vision backbone, dubbed MaxViT, by simply repeating the basic building block over multiple stages. Notably, MaxViT is able to ''see'' globally throughout the entire network, even in earlier, high-resolution stages. We demonstrate the effectiveness of our model on a broad spectrum of vision tasks. On image classification, MaxViT achieves state-of-the-art performance under various settings: without extra data, MaxViT attains 86.5% ImageNet-1K top-1 accuracy; with ImageNet-21K pre-training, our model achieves 88.7% top-1 accuracy. For downstream tasks, MaxViT as a backbone delivers favorable performance on object detection as well as visual aesthetic assessment. We also show that our proposed model expresses strong generative modeling capability on ImageNet, demonstrating the superior potential of MaxViT blocks as a universal vision module. The source code and trained models will be available at https://github.com/google-research/maxvit.
翻译:最近,计算机视觉界对变异器的注意最近引起了人们的注意。然而,在图像大小方面,自我注意机制缺乏伸缩性,限制了它们被广泛采用到最先进的视觉骨干中。在本文中,我们引入了一个高效和可伸缩的注意模式,我们呼吁多轴关注,这包括两个方面:局部屏蔽和扩展全球注意力。这些设计选择允许在任意输入分辨率上进行全球-地方空间互动,只有线性复杂度。我们还展示了一个新的建筑要素,将我们拟议的关注模式与连锁体相融合,并因此提出了简单的等级视觉骨干,称为MaxViet MaxViT, 简单地在多个阶段重复基本建筑块。值得注意的是,MaxViT能够在整个网络中“看到”“看到”,甚至在早期,高分辨率阶段。我们展示了我们模型在广泛的视觉任务方面的有效性。关于图像分类,MaxViat在各种环境中实现最先进的模型性能:没有额外数据,MaxViT获得86.5 %的图像Net-1K顶层的精确度;在图像网-21K前的精度上,通过图像网格的精度的精度,我们所训练的图像-直观的图像将展示的图像显示,我们最深层的图像显示的模型将展示能力作为我们最精确性能展示。