Human conversation is a complex mechanism with subtle nuances. It is hence an ambitious goal to develop artificial intelligence agents that can participate fluently in a conversation. While we are still far from achieving this goal, recent progress in visual question answering, image captioning, and visual question generation shows that dialog systems may be realizable in the not too distant future. To this end, a novel dataset was introduced recently and encouraging results were demonstrated, particularly for question answering. In this paper, we demonstrate a simple symmetric discriminative baseline, that can be applied to both predicting an answer as well as predicting a question. We show that this method performs on par with the state of the art, even memory net based methods. In addition, for the first time on the visual dialog dataset, we assess the performance of a system asking questions, and demonstrate how visual dialog can be generated from discriminative question generation and question answering.
翻译:人类对话是一个复杂的机制,具有微妙的细微差别。 因此,开发能够流畅参与对话的人工智能人员是一个雄心勃勃的目标。 虽然我们还远未实现这一目标,但视觉问答、图像字幕和视觉问题生成方面的最新进展表明,在不太遥远的将来,对话系统可能能够实现。为此,最近引入了一个新颖的数据集,并展示了令人鼓舞的结果,特别是回答问题的结果。在本文中,我们展示了一个简单的对称歧视基线,既可以预测答案,也可以预测问题。我们展示了这种方法与艺术状态相当的表现,甚至以记忆网为基础的方法。此外,我们首次在视觉对话数据集上评估了一个系统询问问题的性能,并展示了如何从歧视性问题生成和回答中产生视觉对话。