Cross-modal retrieval across image and text modalities is a challenging task due to its inherent ambiguity: An image often exhibits various situations, and a caption can be coupled with diverse images. Set-based embedding has been studied as a solution to this problem. It seeks to encode a sample into a set of different embedding vectors that capture different semantics of the sample. In this paper, we present a novel set-based embedding method, which is distinct from previous work in two aspects. First, we present a new similarity function called smooth-Chamfer similarity, which is designed to alleviate the side effects of existing similarity functions for set-based embedding. Second, we propose a novel set prediction module to produce a set of embedding vectors that effectively captures diverse semantics of input by the slot attention mechanism. Our method is evaluated on the COCO and Flickr30K datasets across different visual backbones, where it outperforms existing methods including ones that demand substantially larger computation at inference.
翻译:图像和文本模式之间的交叉模式检索由于其内在的模糊性,是一项具有挑战性的任务: 图像往往显示各种情况, 标题可以与不同的图像相配合。 已经研究了基于设置的嵌入方法, 以解决这一问题。 它试图将样本编码成一组不同的嵌入矢量, 以捕捉样本的不同语义。 本文展示了一种新型基于设置的嵌入方法, 与以往工作在两个方面不同。 首先, 我们提出了一个称为“ 平坦- 隐蔽相似性” 的新的类似功能, 旨在减轻基于设置嵌入的现有相似性函数的副作用。 第二, 我们提议了一个新颖的预测模块, 以生成一套嵌入矢量, 以有效捕捉到槽注意机制对输入的不同精度。 我们的方法是在不同的视觉主干线的CO和Flickr30K数据集上进行评估的, 它比现有方法( 包括需要大得多的推断计算方法) 。