To encourage AI agents to conduct meaningful Visual Dialogue (VD), the use of Reinforcement Learning has been proven potential. In Reinforcement Learning, it is crucial to represent states and assign rewards based on the action-caused transitions of states. However, the state representation in previous Visual Dialogue works uses the textual information only and its transitions are implicit. In this paper, we propose Explicit Concerning States (ECS) to represent what visual contents are concerned at each round and what have been concerned throughout the Visual Dialogue. ECS is modeled from multimodal information and is represented explicitly. Based on ECS, we formulate two intuitive and interpretable rewards to encourage the Visual Dialogue agents to converse on diverse and informative visual information. Experimental results on the VisDial v1.0 dataset show our method enables the Visual Dialogue agents to generate more visual coherent, less repetitive and more visual informative dialogues compared with previous methods, according to multiple automatic metrics, human study and qualitative analysis.
翻译:为了鼓励AI代理机构开展有意义的视觉对话(VD),使用强化学习已被证明具有潜力。在强化学习中,根据国家的行动导致的转变代表各州和分配奖赏至关重要。然而,在以往的视觉对话工作中,国家代表机构只使用文字信息,其过渡是隐含的。在本文中,我们建议明确国家代表每一回合的视觉内容和整个视觉对话期间所关注的内容。ECS以多式信息为模型,并明确代表。根据ECS,我们制定了两个直观和可解释的奖赏,以鼓励视觉对话代理机构对多样的和内容丰富的视觉信息进行反调。VisDial v1.0数据集的实验结果显示我们的方法使得视觉对话代理机构能够产生比以往方法更加一致、不那么重复和更具视觉内容的对话,并采用多种自动计量、人类研究和定性分析。