A goal-oriented visual dialogue involves multi-turn interactions between two agents, Questioner and Oracle. During which, the answer given by Oracle is of great significance, as it provides golden response to what Questioner concerns. Based on the answer, Questioner updates its belief on target visual content and further raises another question. Notably, different answers drive into different visual beliefs and future questions. However, existing methods always indiscriminately encode answers after much longer questions, resulting in a weak utilization of answers. In this paper, we propose an Answer-Driven Visual State Estimator (ADVSE) to impose the effects of different answers on visual states. First, we propose an Answer-Driven Focusing Attention (ADFA) to capture the answer-driven effect on visual attention by sharpening question-related attention and adjusting it by answer-based logical operation at each turn. Then based on the focusing attention, we get the visual state estimation by Conditional Visual Information Fusion (CVIF), where overall information and difference information are fused conditioning on the question-answer state. We evaluate the proposed ADVSE to both question generator and guesser tasks on the large-scale GuessWhat?! dataset and achieve the state-of-the-art performances on both tasks. The qualitative results indicate that the ADVSE boosts the agent to generate highly efficient questions and obtains reliable visual attentions during the reasonable question generation and guess processes.
翻译:面向目标的视觉对话涉及两个代理机构,即问者和甲骨文之间的多方向互动。 在其中,甲骨文给出的答案非常重要,因为它提供了对质疑者所关注问题的金质回应。 根据答案,问者更新了对目标视觉内容的信念,并提出了另一个问题。值得注意的是,不同的答案导致不同的视觉信念和未来问题。然而,现有方法总是在较长的问题过后不加区分地编码答案,导致对答案的利用不力。在本文件中,我们提议一个“答案驱动的视觉国家模拟器”(ADVSE),对视觉状态施加不同答案的效果。首先,我们建议一个“答案驱动的焦点”(ADFA),通过在每次转弯曲时强化与问题有关的关注并调整其对目标视觉关注的答案驱动效果。然后,根据关注的焦点,我们通过“有条件的视觉信息变色信息”(CVIF),将总体信息和差异信息结合到问答状态。我们评估了“答案”的“答案”和“焦点”的“焦点”的“焦点”效果。我们评估了“ADVSE”拟议“视觉生成者和“质量”的“结果”既能动”两个问题,都显示了对高层次的“结果和“结果”的“结果”的理性”的“结果和“结果”分析。