Coreference resolution aims at identifying words and phrases which refer to same entity in a text, a core tool in natural language processing. In this paper, we propose a novel task, resolving coreferences in multimodal data, long-form textual descriptions of visual scenes. Most existing image-text datasets only contain short sentences without coreferent expressions, or coreferences are not annotated. To this end, we first introduce a new dataset, Flickr30k-Coref in which coreference chains and bounding box localization of these chains are annotated. We propose a new technique that learns to identify coreference chains through weakly supervised grounding from image-text pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over prior work in coreference resolution and weakly supervised grounding of long-form text descriptions.
翻译:共同参照决议旨在确定文本中提及同一实体的文字和短语,这是自然语言处理的核心工具。本文提出一项新的任务,解决多式联运数据的共同参照、长式视觉文字描述等内容。大多数现有的图像文本数据集只包含短句,没有核心文字表达,或没有附加注释。为此,我们首先引入一个新的数据集Flick30k-Coref,其中附有注释,说明这些链的共参照链和捆绑框定位。我们提出了一种新的技术,通过从图像文本对齐中薄弱的监管地基以及利用先前的语言知识进行正规化,学习如何识别共参照链。我们的模型在共同参考分辨率和长式文字描述的薄弱监管地基方面,取得了很大的绩效收益。