插拔式调节器在图像文本匹配中的应用 (Plug-and-Play Regulators for Image-Text Matching)

Exploiting fine-grained correspondence and visual-semantic alignments has shown great potential in image-text matching. Generally, recent approaches first employ a cross-modal attention unit to capture latent region-word interactions, and then integrate all the alignments to obtain the final similarity. However, most of them adopt one-time forward association or aggregation strategies with complex architectures or additional information, while ignoring the regulation ability of network feedback. In this paper, we develop two simple but quite effective regulators which efficiently encode the message output to automatically contextualize and aggregate cross-modal representations. Specifically, we propose (i) a Recurrent Correspondence Regulator (RCR) which facilitates the cross-modal attention unit progressively with adaptive attention factors to capture more flexible correspondence, and (ii) a Recurrent Aggregation Regulator (RAR) which adjusts the aggregation weights repeatedly to increasingly emphasize important alignments and dilute unimportant ones. Besides, it is interesting that RCR and RAR are plug-and-play: both of them can be incorporated into many frameworks based on cross-modal interaction to obtain significant benefits, and their cooperation achieves further improvements. Extensive experiments on MSCOCO and Flickr30K datasets validate that they can bring an impressive and consistent R@1 gain on multiple models, confirming the general effectiveness and generalization ability of the proposed methods. Code and pre-trained models are available at: https://github.com/Paranioar/RCAR.

翻译：摘要：利用细粒度的对应关系和视觉语义对齐已经在图像文本匹配方面展示出了巨大的潜力。最近的方法通常首先采用跨模态注意力单元来捕获潜在的区域-词交互，并将所有的对齐整合起来以获得最终的相似度。然而，大多数方法采用一次性向前关联或聚合策略，带有复杂的架构或附加信息，而忽略了网络反馈的调节能力。在本文中，我们开发了两种简单但相当有效的调节器，它们有效地编码输出信息，以自动化地对跨模态表示进行上下文化和聚合。具体而言，我们提出了（i）一种递归对应调节器 (RCR)，它通过自适应的注意力系数逐步促进跨模态注意力单元，以捕获更灵活的对应关系。另外，我们还提出了（ii）一种递归聚合调节器 (RAR)，它反复调整聚合权重，以增加重要对齐的权重并减少不重要对齐的权重。此外，有趣的是，RCR 和RAR 是插拔式的：它们两个都可以被并入基于跨模态交互的许多框架中，以获得重要的收益，它们的合作实现了进一步的改进。在 MSCOCO 和Flickr30K 数据集上进行的大量实验证明，它们可以在多个模型上带来令人印象深刻而一致的 R@1 增益，证实了所提出方法的普适有效性和推广能力。代码和预训练模型可在 https://github.com/Paranioar/RCAR 中获得。