We present our work on the multimodal coreference resolution task of the Situated and Interactive Multimodal Conversation 2.0 (SIMMC 2.0) dataset as a part of the tenth Dialog System Technology Challenge (DSTC10). We propose a UNITER-based model utilizing rich multimodal context such as textual dialog history, object knowledge base and visual dialog scenes to determine whether each object in the current scene is mentioned in the current dialog turn. Results show that the proposed approach outperforms the official DSTC10 baseline substantially, with the object F1 score boosted from 36.6% to 77.3% on the development set, demonstrating the effectiveness of the proposed object representations from rich multimodal input. Our model ranks second in the official evaluation on the object coreference resolution task with an F1 score of 73.3% after model ensembling.
翻译:作为第十个对话系统技术挑战(DSTC10)的一部分,我们介绍了我们关于点对点和互动多式对话2.0(SIMMC 2.0)数据集的多式联运共同解决任务的工作。 我们提议以UNITER为基础的模型,利用内容丰富的多式联运环境,如文本对话历史、目标知识库和视觉对话场景,确定当前对话转弯是否提及当前场景中每个对象。结果显示,拟议方法大大优于DSTC10正式基线,开发成套目标F1的得分从36.6%提高到77.3%,显示了丰富多式联运投入的拟议目标表示的有效性。在对目标共同解决任务的正式评价中,我们的模型排名第二,在模型组合后F1分为73.3%。