Probabilistic embeddings have proven useful for capturing polysemous word meanings, as well as ambiguity in image matching. In this paper, we study the advantages of probabilistic embeddings in a cross-modal setting (i.e., text and images), and propose a simple approach that replaces the standard vector point embeddings in extant image-text matching models with probabilistic distributions that are parametrically learned. Our guiding hypothesis is that the uncertainty encoded in the probabilistic embeddings captures the cross-modal ambiguity in the input instances, and that it is through capturing this uncertainty that the probabilistic models can perform better at downstream tasks, such as image-to-text or text-to-image retrieval. Through extensive experiments on standard and new benchmarks, we show a consistent advantage for probabilistic representations in cross-modal retrieval, and validate the ability of our embeddings to capture uncertainty.
翻译:概率嵌入对于捕捉多个单词的含义以及图像匹配的模糊性都非常有用。 在本文中,我们研究了跨模式环境中(即文本和图像)概率嵌入的优点,并提出了一个简单的方法,以取代标准矢量点嵌入现有图像-文字匹配模型中与概率分布相匹配的模型中的标准矢量点,这些模型的概率嵌入是模拟学学的。 我们的指导假设是,概率嵌入中编码的不确定性能够捕捉输入实例中的跨模式模糊性,而正是通过捕捉这种不确定性,概率模型才能在下游任务(如图像到文本或文本到图像的检索)中更好地发挥作用。 通过对标准和新基准的广泛实验,我们展示了跨模式检索中概率表达的一贯优势,并验证了我们嵌入模型捕捉不确定性的能力。