This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images are aligned to text fragments rather than whole sentences; (ii) multiple images are possible for a text fragment and a sentence; (iii) the sentences are free-form and real-world like; (iv) the parallel texts are multilingual. We set up a fill-in-the-blank game for humans to evaluate the quality of the automatic image selection process of our dataset. We show the utility of the dataset on two automatic tasks: (i) fill-in-the-blank; (ii) lexical translation. Results of the human evaluation and automatic models demonstrate that images can be a useful complement to the textual context. The dataset will benefit research on visual grounding of words especially in the context of free-form sentences, and can be obtained from https://doi.org/10.5281/zenodo.5034604 under a Creative Commons licence.
翻译:本文介绍一个大型多式和多语种数据集,目的是便利在语言背景中使用图像的文字定位研究,数据集由选定图像组成,以清晰地说明电影字幕句子中表达的概念。数据集是一种宝贵的资源,因为(一) 图像与文字碎片而不是整个句子对齐;(二) 文本碎片和句子可以使用多种图像;(三) 句子是自由形式和真实世界的;(四) 平行文本是多语种的。我们为人类设置了一个空格游戏,以评价我们数据集自动图像选择过程的质量。我们展示了数据集在两个自动任务上的效用:(一) 填补空白;(二) 词汇翻译。人类评估和自动模型的结果表明图像可以成为文本背景下的有用补充。数据集将有利于对词语的直观定位研究,特别是在自由形式句子的背景下,并可从https://doi.org/10.5281/zenodo-604中获取。