Cross-modal attention mechanisms have been widely applied to the image-text matching task and have achieved remarkable improvements thanks to its capability of learning fine-grained relevance across different modalities. However, the cross-modal attention models of existing methods could be sub-optimal and inaccurate because there is no direct supervision provided during the training process. In this work, we propose two novel training strategies, namely Contrastive Content Re-sourcing (CCR) and Contrastive Content Swapping (CCS) constraints, to address such limitations. These constraints supervise the training of cross-modal attention models in a contrastive learning manner without requiring explicit attention annotations. They are plug-in training strategies and can be easily integrated into existing cross-modal attention models. Additionally, we introduce three metrics including Attention Precision, Recall, and F1-Score to quantitatively measure the quality of learned attention models. We evaluate the proposed constraints by incorporating them into four state-of-the-art cross-modal attention-based image-text matching models. Experimental results on both Flickr30k and MS-COCO datasets demonstrate that integrating these constraints improves the model performance in terms of both retrieval performance and attention metrics.
翻译:在这项工作中,我们提出了两项新的培训战略,即:反内容重新承包和反内容交换限制,以解决这些限制问题。这些限制以对比式学习方式监督跨模式关注模式的培训,而无需明确说明。它们是插接式培训战略,可以很容易地纳入现有的跨模式关注模式。此外,我们引入了三项衡量标准,包括注意精度、回召和F1-核心,以量化衡量所学关注模式的质量。我们通过将这些限制纳入四种最先进的跨模式关注基础图像-文字匹配模式来评估拟议的限制。Flickr30k和MS-CO数据集的实验性结果表明,这些限制提高了模型在跟踪和标准性能方面的注意。