Learning semantically meaningful sentence embeddings is an open problem in natural language processing. In this work, we propose a sentence embedding learning approach that exploits both visual and textual information via a multimodal contrastive objective. Through experiments on a variety of semantic textual similarity tasks, we demonstrate that our approach consistently improves the performance across various datasets and pre-trained encoders. In particular, combining a small amount of multimodal data with a large text-only corpus, we improve the state-of-the-art average Spearman's correlation by 1.7%. By analyzing the properties of the textual embedding space, we show that our model excels in aligning semantically similar sentences, providing an explanation for its improved performance.
翻译:在自然语言处理过程中,隐含的句子是一个开放的问题。在这项工作中,我们建议采用一个嵌入学习方法的句子,通过多式对比目标利用视觉和文字信息。通过对各种语义文本相似性任务的实验,我们证明我们的方法一贯改善各种数据集和预先训练的编码器的性能。特别是,将少量的多式联运数据与大片只使用文字的文体结合起来,我们改进了最先进的普通Spearman的相对性,增加了1.7%。我们通过分析文字嵌入空间的特性,我们展示了我们的模型在调整语义相似的句子方面优异,为它改进了性能提供了解释。