Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands the understanding of videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the domains, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs, built on top of video snippets and query tokens separately, which are used for modeling the intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with natural language queries: ActivityNet-Captions, TACoS, and DiDeMo.
翻译:视频中的地面语言询问旨在确定与语言查询相关的时间间隔( 或时间) 。 解决这一具有挑战性的任务需要理解视频和查询的语义内容, 以及对其多模式互动的精细推理。 我们的关键想法是将这项挑战重新写成一个算法图表匹配问题。 由图表神经网络的最近进步推动, 我们提议利用图表革命网络来模拟视频和文字信息, 以及它们的语义匹配。 为了让跨域的信息交流成为可能, 我们设计了一个新颖的视频- Language图表匹配网络( VLG- Net) 来匹配视频和查询图表。 核心要素包括演示图, 以视频片段和查询符号为顶端, 用于模拟内部模式关系。 我们提议利用图表匹配层匹配层图层图为跨模式背景建模和多模式融合。 最后, 我们通过利用时空功能添加时尚关注, 将时空匹配网络匹配网络匹配网络的功能匹配到视频和查询图表图表图表图表的高级性能。 我们展示了用于州际平时段的图像。