Designed for tracking user goals in dialogues, a dialogue state tracker is an essential component in a dialogue system. However, the research of dialogue state tracking has largely been limited to unimodality, in which slots and slot values are limited by knowledge domains (e.g. restaurant domain with slots of restaurant name and price range) and are defined by specific database schema. In this paper, we propose to extend the definition of dialogue state tracking to multimodality. Specifically, we introduce a novel dialogue state tracking task to track the information of visual objects that are mentioned in video-grounded dialogues. Each new dialogue utterance may introduce a new video segment, new visual objects, or new object attributes, and a state tracker is required to update these information slots accordingly. We created a new synthetic benchmark and designed a novel baseline, Video-Dialogue Transformer Network (VDTN), for this task. VDTN combines both object-level features and segment-level features and learns contextual dependencies between videos and dialogues to generate multimodal dialogue states. We optimized VDTN for a state generation task as well as a self-supervised video understanding task which recovers video segment or object representations. Finally, we trained VDTN to use the decoded states in a response prediction task. Together with comprehensive ablation and qualitative analysis, we discovered interesting insights towards building more capable multimodal dialogue systems.
翻译:为跟踪对话中的用户目标而设计的对话国跟踪器是对话系统的一个基本组成部分,然而,对话国跟踪器的研究主要限于单一模式,在这种模式中,空档和空档价值受知识领域(如餐厅域,有餐厅名称和价格范围的位置)的限制,并受特定数据库模式的界定。在本文件中,我们提议将对话国跟踪的定义扩大到多式联运。具体地说,我们引入了一个新的对话国跟踪任务,以跟踪视频地面对话中提到的视觉对象的信息。每个新的对话语句都可能引入一个新的视频段、新的视觉对象或新的对象属性,并需要一个州跟踪器来相应更新这些信息。我们为这项任务创建了新的合成基准并设计了一个新的基线,即视频对话变换网络(VDTN)。VDTN将目标层面的特性和分级特性结合起来,并学习视频和对话之间的背景依赖性关系,以生成多式对话状态。我们优化VDTN任务生成的系统,以构建一个自我强化的视频目标,并构建一个自我强化的视频数据序列,我们最终用经过培训的多媒体分析,我们所发现的视频分级分析。