GuessWhat?! is a visual dialogue task between a guesser and an oracle. The guesser aims to locate an object supposed by the oracle oneself in an image by asking a sequence of Yes/No questions. Asking proper questions with the progress of dialogue is vital for achieving successful final guess. As a result, the progress of dialogue should be properly represented and tracked. Previous models for question generation pay less attention on the representation and tracking of dialogue states, and therefore are prone to asking low quality questions such as repeated questions. This paper proposes visual dialogue state tracking (VDST) based method for question generation. A visual dialogue state is defined as the distribution on objects in the image as well as representations of objects. Representations of objects are updated with the change of the distribution on objects. An object-difference based attention is used to decode new question. The distribution on objects is updated by comparing the question-answer pair and objects. Experimental results on GuessWhat?! dataset show that our model significantly outperforms existing methods and achieves new state-of-the-art performance. It is also noticeable that our model reduces the rate of repeated questions from more than 50% to 21.9% compared with previous state-of-the-art methods.
翻译:答案是什么?!! 是一个猜测者与 先知之间的视觉对话任务?!!!!! 是一个猜测者与 先知之间的视觉对话任务 。 猜测者的目的是通过询问“ 是/ 不 问题” 的序列来定位一个由先知自己想象的物体在图像中的位置。 询问与对话进展有关的正确问题对于成功最终猜测至关重要 。 因此, 对话的进展应该被适当代表并跟踪 。 之前的问题生成模型对对话状态的表达和跟踪不那么关注, 因此很容易问一些低质量的问题, 比如重复的问题 。 本文建议以视觉对话状态跟踪( VDST) 为基础为生成问题的方法 。 视觉对话状态被定义为图像中对象的分布以及对象的表达方式。 视觉对话状态的表达方式随着对象分布的变化而更新。 基于对象差异的注意被用来解析新问题。 对象的分布是通过比较问答对象和对象的表达方式来更新的。 Guessa 的实验结果?!!! 数据集显示我们的模型大大超过现有方法, 并实现了新的状态性表现。 。 。 也明显地我们的模型降低了重复问题的速度, 从50- 9 与前一比前一比前一州的比率。