Cross-modal retrieval methods build a common representation space for samples from multiple modalities, typically from the vision and the language domains. For images and their captions, the multiplicity of the correspondences makes the task particularly challenging. Given an image (respectively a caption), there are multiple captions (respectively images) that equally make sense. In this paper, we argue that deterministic functions are not sufficiently powerful to capture such one-to-many correspondences. Instead, we propose to use Probabilistic Cross-Modal Embedding (PCME), where samples from the different modalities are represented as probabilistic distributions in the common embedding space. Since common benchmarks such as COCO suffer from non-exhaustive annotations for cross-modal matches, we propose to additionally evaluate retrieval on the CUB dataset, a smaller yet clean database where all possible image-caption pairs are annotated. We extensively ablate PCME and demonstrate that it not only improves the retrieval performance over its deterministic counterpart, but also provides uncertainty estimates that render the embeddings more interpretable.
翻译:跨模式检索方法为来自多种模式的样本构建了一个共同的表示空间, 通常来自视觉和语言域。 对于图像及其字幕来说, 函文的多重性使得任务特别具有挑战性。 由于图像( 分别是一个标题), 有多种标题( 不同的图像), 也同样有意义 。 在本文中, 我们争论说, 确定性功能不足以捕捉这种一对一对一对一对一的通信。 相反, 我们提议使用概率跨模式嵌入( PCME ), 不同模式的样本代表着共同嵌入空间的概率分布。 由于像 COCO 这样的共同基准具有跨模式匹配的非穷尽性说明, 我们提议对 CUB 数据集的检索进行额外的评估, 一个较小但清洁的数据库, 其中所有可能的图像描述配对都有注释。 我们广泛更新 PCME, 并表明它不仅改善了相对于其确定性对应方的检索性, 而且还提供了不确定性的估算, 使得嵌入过程更容易解释。