Video-grounded dialogue systems aim to integrate video understanding and dialogue understanding to generate responses that are relevant to both the dialogue and video context. Most existing approaches employ deep learning models and have achieved remarkable performance, given the relatively small datasets available. However, the results are partly accomplished by exploiting biases in the datasets rather than developing multimodal reasoning, resulting in limited generalization. In this paper, we propose a novel approach of Compositional Counterfactual Contrastive Learning ($C^3$) to develop contrastive training between factual and counterfactual samples in video-grounded dialogues. Specifically, we design factual/counterfactual sampling based on the temporal steps in videos and tokens in dialogues and propose contrastive loss functions that exploit object-level or action-level variance. Different from prior approaches, we focus on contrastive hidden state representations among compositional output tokens to optimize the representation space in a generation setting. We achieved promising performance gains on the Audio-Visual Scene-Aware Dialogues (AVSD) benchmark and showed the benefits of our approach in grounding video and dialogue context.
翻译:以视频为基础的对话系统旨在整合视频理解和对话理解,以产生与对话和视频背景相关的反应。鉴于现有的数据集相对较少,大多数现有方法都采用深层次学习模式,并取得了显著的成绩;然而,取得部分成果的途径是利用数据集中的偏见,而不是发展多式推理,从而造成有限的笼统化。在本文件中,我们建议采用 " 合成反事实对抗学习 " 的新做法(C3美元),在视频背景对话中发展对事实和反事实样本的对比性培训。具体地说,我们根据对话中的视频和标语的时间步骤设计事实/对事实抽样,并提出利用目标层面或行动层面差异的对比性损失功能。不同于以往的做法,我们侧重于在组成输出符号之间形成对比的隐蔽状态,以优化一代环境中的代表空间。我们在视频-视频-软件对话基准上取得了有希望的业绩收益,并展示了我们在定位视频和对话背景下采用的方法的好处。