Existing multimodal conversation agents have shown impressive abilities to locate absolute positions or retrieve attributes in simple scenarios, but they fail to perform well when complex relative positions and information alignments are involved, which poses a bottleneck in response quality. In this paper, we propose a Situated Conversation Agent Petrained with Multimodal Questions from INcremental Layout Graph (SPRING) with abilities of reasoning multi-hops spatial relations and connecting them with visual attributes in crowded situated scenarios. Specifically, we design two types of Multimodal Question Answering (MQA) tasks to pretrain the agent. All QA pairs utilized during pretraining are generated from novel Incremental Layout Graphs (ILG). QA pair difficulty labels automatically annotated by ILG are used to promote MQA-based Curriculum Learning. Experimental results verify the SPRING's effectiveness, showing that it significantly outperforms state-of-the-art approaches on both SIMMC 1.0 and SIMMC 2.0 datasets.
翻译:现有多式对话代理机构在简单假设情景中表现出了确定绝对位置或检索属性的惊人能力,但在涉及复杂的相对位置和信息匹配时,它们表现不佳,这在反应质量上构成瓶颈。在本文中,我们提议由一位坐落式交谈代理机构与来自智能版图(Spring)的多式问题打交道,有能力推理多式空间关系,并在拥挤的情景中将其与视觉属性联系起来。具体地说,我们设计了两种多式问答(MQA)任务来为该代理机构做准备。所有在培训前使用的QA配对都来自新的递增版图(ILG)。 QA配对困难标签被ILG自动加注用于促进基于MQA的课程学习。实验结果验证了SPRing的效能,显示它大大超越了SIMMC 1.0 和 SIMMC 2. 0 数据集的先进方法。