Visual dialog is a task of answering a sequence of questions grounded in an image using the previous dialog history as context. In this paper, we study how to address two fundamental challenges for this task: (1) reasoning over underlying semantic structures among dialog rounds and (2) identifying several appropriate answers to the given question. To address these challenges, we propose a Sparse Graph Learning (SGL) method to formulate visual dialog as a graph structure learning task. SGL infers inherently sparse dialog structures by incorporating binary and score edges and leveraging a new structural loss function. Next, we introduce a Knowledge Transfer (KT) method that extracts the answer predictions from the teacher model and uses them as pseudo labels. We propose KT to remedy the shortcomings of single ground-truth labels, which severely limit the ability of a model to obtain multiple reasonable answers. As a result, our proposed model significantly improves reasoning capability compared to baseline methods and outperforms the state-of-the-art approaches on the VisDial v1.0 dataset. The source code is available at https://github.com/gicheonkang/SGLKT-VisDial.
翻译:视觉对话是用先前的对话框历史作为上下文回答基于图像的一系列问题的任务。 在本文中, 我们研究如何应对这一任务的两个基本挑战:(1) 对各对话回合之间基本的语义结构进行推理, 以及(2) 确定对特定问题的若干适当答案。 为了应对这些挑战, 我们提议了一种粗略图表学习(SGL) 方法, 将视觉对话作为图形结构学习任务。 SGL 通过将二进制和分边结合, 并利用新的结构损失功能, 推断出内在的稀少的对话框结构。 其次, 我们引入了一种知识传输( KT) 方法, 从教师模型中提取答案预测, 并将它们用作假标签。 我们建议 KT 来纠正单一地义标签的缺陷, 这些缺陷严重限制了模型获得多重合理答案的能力。 结果, 我们提议的模型大大改进了与基线方法相比的推理能力, 并超越了VisDial v1.0数据集的状态- 艺术方法。 源代码可在 https://github. com/gicheankang/SG- VisalDal。