Coreference resolution aims to identify words and phrases which refer to same entity in a text, a core task in natural language processing. In this paper, we extend this task to resolving coreferences in long-form narrations of visual scenes. First we introduce a new dataset with annotated coreference chains and their bounding boxes, as most existing image-text datasets only contain short sentences without coreferring expressions or labeled chains. We propose a new technique that learns to identify coreference chains using weak supervision, only from image-text pairs and a regularization using prior linguistic knowledge. Our model yields large performance gains over several strong baselines in resolving coreferences. We also show that coreference resolution helps improving grounding narratives in images.
翻译:共指消解的目标是识别文本中引用同一实体的单词和短语,是自然语言处理中的核心任务。在本文中,我们将这个任务扩展到视觉场景的长形叙述中解决共指。首先,我们引入了一个新的数据集,其中包含注释的共指链和它们的边界框,因为现有的图像-文本数据集只包含没有共指表达式或标记链的短句。我们提出了一种新技术,利用弱监督只从图像-文本对和先前的语言知识学习识别共指链的模型。我们的模型在解决共指问题方面比几个强基线模型产生了大量性能提升。我们还表明,共指消解有助于提高图像中叙述的接地性。