We present Neighborhood Attention Transformer (NAT), an efficient, accurate and scalable hierarchical transformer that works well on both image classification and downstream vision tasks. It is built upon Neighborhood Attention (NA), a simple and flexible attention mechanism that localizes the receptive field for each query to its nearest neighboring pixels. NA is a localization of self-attention, and approaches it as the receptive field size increases. It is also equivalent in FLOPs and memory usage to Swin Transformer's shifted-window attention given the same receptive field size, while being less constrained. Furthermore, NA includes local inductive biases, which eliminate the need for extra operations such as pixel shifts. Experimental results on NAT are competitive; NAT-Tiny reaches 83.2% top-1 accuracy on ImageNet with only 4.3 GFLOPs and 28M parameters, 51.4% mAP on MS-COCO and 48.4% mIoU on ADE20k. We open-sourced our checkpoints, code and CUDA kernel at: https://github.com/SHI-Labs/Neighborhood-Attention-Transformer.
翻译:我们提出邻里注意变异器(NAT),这是一个高效、准确和可扩展的等级变异器,在图像分类和下游视觉任务方面都运作良好,它以邻里注意(NA)为基础,这是一个简单和灵活的注意机制,将每个查询的可接受字段定位到最近的相邻像素。NA是一个自我注意的本地化,随着可接受场面积的扩大而接近它。在FLOPs和记忆使用中,Swin变异器的转移窗口注意力也相当于Swin变色器的转移式注意,这种注意在接受场大小相同,但受限制较少。此外,NA包括地方的感应偏向性偏见,消除了对像像素转移这样的额外操作的需要。NAT的实验结果是竞争性的;NAT-Tiny在图像网上达到83.2%的高度-1精度,只有4.3GFLOPs和28M参数,在MS-CO和ADE20k的48.4% mIOU上达到。我们公开来源的检查站、代码和CUDA K:https://github.com/SHILABRINAxion-GIS-GIS-GIS-GIS-GIS-GIS-GRION)。