Grounding language queries in videos aims at identifying the time interval (or moment) semantically relevant to a language query. The solution to this challenging task demands understanding videos' and queries' semantic content and the fine-grained reasoning about their multi-modal interactions. Our key idea is to recast this challenge into an algorithmic graph matching problem. Fueled by recent advances in Graph Neural Networks, we propose to leverage Graph Convolutional Networks to model video and textual information as well as their semantic alignment. To enable the mutual exchange of information across the modalities, we design a novel Video-Language Graph Matching Network (VLG-Net) to match video and query graphs. Core ingredients include representation graphs built atop video snippets and query tokens separately and used to model intra-modality relationships. A Graph Matching layer is adopted for cross-modal context modeling and multi-modal fusion. Finally, moment candidates are created using masked moment attention pooling by fusing the moment's enriched snippet features. We demonstrate superior performance over state-of-the-art grounding methods on three widely used datasets for temporal localization of moments in videos with language queries: ActivityNet-Captions, TACoS, and DiDeMo.
翻译:视频中的地面语言询问旨在确定与语言查询相关的时间间隔( 或时间) 。 挑战性任务的解决方案要求理解视频和查询的语义内容, 以及对其多模式互动的精细推理。 我们的关键想法是将这项挑战重新写成一个算法图表匹配问题。 在图像神经网络最近的进展下, 我们提议利用图形革命网络来模拟视频和文本信息, 以及它们的语义对齐配。 为了能够在不同模式中相互交流信息, 我们设计了一个新颖的视频语言图表匹配网络( VLG- Net) 来匹配视频和查询图表。 核心要素包括在视频片段和查询符号上制作的演示图表, 并单独用于模拟内部模式关系。 我们建议利用图像神经网络的模型模型和多模式融合, 利用时尚化功能来创造隐藏的瞬间关注。 我们展示了在州- 艺术图像匹配网络上高级性表现, 并展示了用于模型内部关系模型的磁盘搜索工具 。 Dasimational- divation 3 Paptional- droom- droom distration distrations