This paper introduces a large-scale multimodal and multilingual dataset that aims to facilitate research on grounding words to images in their contextual usage in language. The dataset consists of images selected to unambiguously illustrate concepts expressed in sentences from movie subtitles. The dataset is a valuable resource as (i) the images are aligned to text fragments rather than whole sentences; (ii) multiple images are possible for a text fragment and a sentence; (iii) the sentences are free-form and real-world like; (iv) the parallel texts are multilingual. We set up a fill-in-the-blank game for humans to evaluate the quality of the automatic image selection process of our dataset. We show the utility of the dataset on two automatic tasks: (i) fill-in-the blank; (ii) lexical translation. Results of the human evaluation and automatic models demonstrate that images can be a useful complement to the textual context. The dataset will benefit research on visual grounding of words especially in the context of free-form sentences.
翻译:本文介绍一个大型多式和多语种数据集,目的是便利在语言背景使用图像时对文字进行定位研究。数据集由选定图像组成,以明确说明电影字幕句中表达的概念。数据集是一种宝贵的资源,因为(一) 图像与文字碎片而不是整个句子对齐;(二) 文本碎片和句子可以使用多种图像;(三) 句子是自由形式和真实世界的;(四) 平行文本是多语种。我们为人类设置了一个填空游戏,以评价我们数据集自动图像选择过程的质量。我们展示数据集在两个自动任务上的效用:(一) 填空;(二) 词汇翻译。人类评价和自动模型的结果表明图像可以成为文字背景的有用补充。数据集将有利于对文字的直观地面研究,特别是在自由格式句子中。