Vision Transformers achieved outstanding performance in many computer vision tasks. Early Vision Transformers such as ViT and DeiT adopt global self-attention, which is computationally expensive when the number of patches is large. To improve efficiency, recent Vision Transformers adopt local self-attention mechanisms, where self-attention is computed within local windows. Despite the fact that window-based local self-attention significantly boosts efficiency, it fails to capture the relationships between distant but similar patches in the image plane. To overcome this limitation of image-space local attention, in this paper, we further exploit the locality of patches in the feature space. We group the patches into multiple clusters using their features, and self-attention is computed within every cluster. Such feature-space local attention effectively captures the connections between patches across different local windows but still relevant. We propose a Bilateral lOcal Attention vision Transformer (BOAT), which integrates feature-space local attention with image-space local attention. We further integrate BOAT with both Swin and CSWin models, and extensive experiments on several benchmark datasets demonstrate that our BOAT-CSWin model clearly and consistently outperforms existing state-of-the-art CNN models and vision Transformers.
翻译:视觉转换器在许多计算机视觉任务中取得了杰出的成绩。像ViT和DeiT这样的早期视觉变异器在很多补丁数量巨大时采用全球自省,计算成本昂贵。为了提高效率,最近的视觉变异器采用了本地自省机制,在本地窗口内进行自省计算。尽管基于窗口的本地自我注意极大地提高了效率,但它未能捕捉到图像平面上遥远但相似的补丁之间的关系。为了克服图像-空间地方关注的局限性,我们在本文中进一步利用地物空间的补丁点位置。我们利用它们的特征将补丁分成多个组,在每个组内计算自省。这些地物空间变异器有效地捕捉到不同地方窗口的补丁之间的联系,但仍然具有相关性。我们建议采用双边的液态注意变换器(BOAT),将地物空间的注意与图像-空间局部关注结合起来。我们进一步将BOAT与Swin和CSBIN模型结合起来,并在几个基准数据集上进行广泛的实验,表明我们BOAT-CSWISFAR的模型明确和持续超越现有状态。