Very recently, a variety of vision transformer architectures for dense prediction tasks have been proposed and they show that the design of spatial attention is critical to their success in these tasks. In this work, we revisit the design of the spatial attention and demonstrate that a carefully-devised yet simple spatial attention mechanism performs favourably against the state-of-the-art schemes. As a result, we propose two vision transformer architectures, namely, Twins-PCPVT and Twins-SVT. Our proposed architectures are highly-efficient and easy to implement, only involving matrix multiplications that are highly optimized in modern deep learning frameworks. More importantly, the proposed architectures achieve excellent performance on a wide range of visual tasks, including image level classification as well as dense detection and segmentation. The simplicity and strong performance suggest that our proposed architectures may serve as stronger backbones for many vision tasks. Our code is released at https://github.com/Meituan-AutoML/Twins .
翻译:最近,提出了用于密集预测任务的各种视觉变压器结构,这些结构表明,空间注意的设计对于其成功完成这些任务至关重要。在这项工作中,我们重新审视空间注意的设计,并表明仔细设计但简单的空间注意机制对最先进的方案有利。结果,我们提出了两个视觉变压器结构,即双型-PCPVT和双型-SVT。我们提议的结构效率高,易于执行,只涉及在现代深层学习框架中高度优化的矩阵倍增。更重要的是,拟议的结构在广泛的视觉任务上取得优异的性能,包括图像等级分类以及密集的探测和分解。简洁和有力的表现表明,我们提议的结构可以作为许多视觉任务的更强大的支柱。我们的代码在https://github.com/Metuan-AutomL/Twins发布。