Visual Dialog is a challenging vision-language task since the visual dialog agent needs to answer a series of questions after reasoning over both the image content and dialog history. Though existing methods try to deal with the cross-modal understanding in visual dialog, they are still not enough in ranking candidate answers based on their understanding of visual and textual contexts. In this paper, we analyze the cross-modal understanding in visual dialog based on the vision-language pre-training model VD-BERT and propose a novel approach to improve the cross-modal understanding for visual dialog, named ICMU. ICMU enhances cross-modal understanding by distinguishing different pulled inputs (i.e. pulled images, questions or answers) based on four-way contrastive learning. In addition, ICMU exploits the single-turn visual question answering to enhance the visual dialog model's cross-modal understanding to handle a multi-turn visually-grounded conversation. Experiments show that the proposed approach improves the visual dialog model's cross-modal understanding and brings satisfactory gain to the VisDial dataset.
翻译:视觉对话框是一项具有挑战性的视觉语言任务, 因为视觉对话代理器需要在对图像内容和对话框历史进行推理后回答一系列问题。 虽然现有方法试图在视觉对话中处理跨模式理解, 但根据对视觉和文字背景的理解, 仍然不足以排列候选人的答案。 在本文中, 我们分析视觉对话中的跨模式理解, 所依据的是视觉语言预培训模式VD- BERT, 并提议一种新颖的方法来改进视觉对话的跨模式理解, 名为ICMU。 ICMU通过区分基于四路对比学习的不同拉动输入( 即拉动图像、 问答) 来增强跨模式理解。 此外, ICMU还利用单曲视觉问题回答来增强视觉对话模型的跨模式理解, 以便处理多方向视觉地面对话。 实验显示, 拟议的方法改善了视觉对话模式的交叉理解, 并为VisDal数据集带来令人满意的收益 。