The visual dialog task requires an AI agent to interact with humans in multi-round dialogs based on a visual environment. As a common linguistic phenomenon, pronouns are often used in dialogs to improve the communication efficiency. As a result, resolving pronouns (i.e., grounding pronouns to the noun phrases they refer to) is an essential step towards understanding dialogs. In this paper, we propose VD-PCR, a novel framework to improve Visual Dialog understanding with Pronoun Coreference Resolution in both implicit and explicit ways. First, to implicitly help models understand pronouns, we design novel methods to perform the joint training of the pronoun coreference resolution and visual dialog tasks. Second, after observing that the coreference relationship of pronouns and their referents indicates the relevance between dialog rounds, we propose to explicitly prune the irrelevant history rounds in visual dialog models' input. With pruned input, the models can focus on relevant dialog history and ignore the distraction in the irrelevant one. With the proposed implicit and explicit methods, VD-PCR achieves state-of-the-art experimental results on the VisDial dataset.
翻译:视觉对话任务要求AI 代理机构在基于视觉环境的多轮对话中与人类互动。 作为常见的语言现象, 在对话中经常使用代名词来提高交流效率。 因此, 解决代名词( 将代名词以他们提到的代名词为基础) 是理解对话的一个重要步骤 。 在本文中, 我们提议 VD- PCR, 是一个以隐含和明确的方式改善与Pronoun Coreference 分辨率的视觉对话理解的新框架 。 首先, 暗含地帮助模型理解代名词, 我们设计了新颖的方法, 用于对代名词共引用解析和视觉对话任务进行联合培训 。 其次, 在观察了代名词及其参照词的共参照关系后, 我们提议在视觉对话模型输入中明确描述无关的历史回合。 有了原始输入, 模型可以专注于相关对话历史, 忽略不相干的内容 。 通过拟议的隐含和明确的方法, VD- PCR 在 ViD 数据上实现状态实验结果 。