Attention mechanisms have been widely applied to cross-modal tasks such as image captioning and information retrieval, and have achieved remarkable improvements due to its capability to learn fine-grained relevance across different modalities. However, existing attention models could be sub-optimal and lack preciseness because there is no direct supervision involved during training. In this work, we propose Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints to address such limitation. These constraints supervise the training of attention models in a contrastive learning manner without requiring explicit attention annotations. Additionally, we introduce three metrics, namely Attention Precision, Recall and F1-Score, to quantitatively evaluate the attention quality. We evaluate the proposed constraints with cross-modal retrieval (image-text matching) task. The experiments on both Flickr30k and MS-COCO datasets demonstrate that integrating these attention constraints into two state-of-the-art attention-based models improves the model performance in terms of both retrieval accuracy and attention metrics.
翻译:关注机制被广泛应用于图像说明和信息检索等跨模式任务,并取得了显著的改进,因为其有能力学习不同模式的细微相关性,然而,现有的关注模式可能不够完美,而且由于培训过程中没有直接监督,因此缺乏准确性;在这项工作中,我们提议采用反向内容重新承包和反向内容交换的制约因素来解决这种限制;这些制约因素以对比性学习方式监督关注模式的培训,而不需要明确注意说明;此外,我们引入了三种衡量标准,即注意精确度、回召和F1-Score,以定量评价关注质量;我们评估了拟议的跨模式检索(模拟文本匹配)任务方面的制约因素;关于Flick30k和MS-CO数据集的实验表明,将这些关注制约纳入两个最先进的关注模式可以提高检索准确度和关注度指标方面的示范性业绩。