Recently, self-attention mechanisms have shown impressive performance in various NLP and CV tasks, which can help capture sequential characteristics and derive global information. In this work, we explore how to extend self-attention modules to better learn subtle feature embeddings for recognizing fine-grained objects, e.g., different bird species or person identities. To this end, we propose a dual cross-attention learning (DCAL) algorithm to coordinate with self-attention learning. First, we propose global-local cross-attention (GLCA) to enhance the interactions between global images and local high-response regions, which can help reinforce the spatial-wise discriminative clues for recognition. Second, we propose pair-wise cross-attention (PWCA) to establish the interactions between image pairs. PWCA can regularize the attention learning of an image by treating another image as distractor and will be removed during inference. We observe that DCAL can reduce misleading attentions and diffuse the attention response to discover more complementary parts for recognition. We conduct extensive evaluations on fine-grained visual categorization and object re-identification. Experiments demonstrate that DCAL performs on par with state-of-the-art methods and consistently improves multiple self-attention baselines, e.g., surpassing DeiT-Tiny and ViT-Base by 2.8% and 2.4% mAP on MSMT17, respectively.
翻译:最近,自留机制在各种NLP和CV任务中表现出了令人印象深刻的业绩,这些任务可以帮助捕捉相继特征并获得全球信息。在这项工作中,我们探索如何扩大自留模块,以更好地学习识别细微对象(例如,不同的鸟类物种或人的身份)的微妙特征嵌入。为此,我们提出一种双重的互留学习(DCAL)算法,以与自留学习相协调。首先,我们提议全球-地方交叉注意(GLCA)加强全球图像与地方高反应区域之间的相互作用,这可以帮助加强空间上明智的识别歧视线索。第二,我们提议双向交叉注意模块(PWCA),以建立图像对配方之间的互动。PWCA可以将另一图像视为分散器来规范对图像的注意力学习。我们发现,DCAL可以减少误导性注意力并分散注意力,以发现更多的识别部分。我们广泛评价精确的视觉分类和对象重新定位。我们建议双向双向交叉注意(PWCA)进行实验,并用双向双向双向双向地进行自我定位,用DAL-BABABABAL-BAR-BAR-BAR-BY-BAR-BAR-BY-BY-BY-BAR-BAR-BAS-BY-BY-BY-BY-D-BY-BY-BY-BAS-BY-BY-BY-BAR-BAR-BAR-BAR-BY-BY-BY-BY-BAR-BAR-BAR-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-D-D-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-BY-D-D-D-