Video-and-Language Inference is a recently proposed task for joint video-and-language understanding. This new task requires a model to draw inference on whether a natural language statement entails or contradicts a given video clip. In this paper, we study how to address three critical challenges for this task: judging the global correctness of the statement involved multiple semantic meanings, joint reasoning over video and subtitles, and modeling long-range relationships and complex social interactions. First, we propose an adaptive hierarchical graph network that achieves in-depth understanding of the video over complex interactions. Specifically, it performs joint reasoning over video and subtitles in three hierarchies, where the graph structure is adaptively adjusted according to the semantic structures of the statement. Secondly, we introduce semantic coherence learning to explicitly encourage the semantic coherence of the adaptive hierarchical graph network from three hierarchies. The semantic coherence learning can further improve the alignment between vision and linguistics, and the coherence across a sequence of video segments. Experimental results show that our method significantly outperforms the baseline by a large margin.
翻译:视频和语系推断是最近提出的视频和语言共同理解的任务。 这一新的任务需要一个模型来推断自然语言语句是否包含或与给定视频剪辑相矛盾。 在本文中,我们研究如何应对这项任务的三大挑战: 判断语句的全球正确性涉及多种语义含义, 视频和字幕的联合推理, 以及长程关系和复杂社会互动的建模。 首先, 我们提议一个适应性等级图形网络, 从而在复杂的互动中深入理解视频。 具体地说, 它对三个等级的视频和字幕进行联合推理, 图形结构根据语义结构进行调整。 第二, 我们引入语义一致性学习, 明确鼓励三个等级结构的适应性等级图形网络的语义一致性。 语系一致性学习可以进一步改善视觉和语言之间的对齐, 以及一系列视频的连贯性。 实验结果表明, 我们的方法大大超越了大边缘的基线 。