This paper focuses on tackling the problem of temporal language localization in videos, which aims to identify the start and end points of a moment described by a natural language sentence in an untrimmed video. However, it is non-trivial since it requires not only the comprehensive understanding of the video and sentence query, but also the accurate semantic correspondence capture between them. Existing efforts are mainly centered on exploring the sequential relation among video clips and query words to reason the video and sentence query, neglecting the other intra-modal relations (e.g., semantic similarity among video clips and syntactic dependency among the query words). Towards this end, in this work, we propose a Multi-modal Interaction Graph Convolutional Network (MIGCN), which jointly explores the complex intra-modal relations and inter-modal interactions residing in the video and sentence query to facilitate the understanding and semantic correspondence capture of the video and sentence query. In addition, we devise an adaptive context-aware localization method, where the context information is taken into the candidate moments and the multi-scale fully connected layers are designed to rank and adjust the boundary of the generated coarse candidate moments with different lengths. Extensive experiments on Charades-STA and ActivityNet datasets demonstrate the promising performance and superior efficiency of our model.
翻译:本文侧重于解决视频中时间语言本地化问题,目的是确定一个自然语言句在未经剪辑的视频中描述的一个时刻的起始点和终点点,然而,这是非三边的,因为它不仅需要全面理解视频和句子查询,而且需要准确的语义对应捕捉;现有努力主要侧重于探讨视频片段和查询词之间的相继关系,以说明视频和句子查询的理由,忽视其他模式内部关系(例如视频片段之间的语义相似性和查询词词的共通依赖性);为此,我们提议建立一个多模式互动图动画网络(MIGCN),共同探讨视频和句子查询中存在的复杂模式内部关系和模式互动,以便利理解和语义对应捕捉视频和句子查询。此外,我们设计了适应性的背景认识本地化方法,将背景信息带入候选时段和多尺度的完全关联层。 为此,我们提议为此建立一个多模式的多模式,即互动动动动动画网络(MIGCN)网络,共同探索视频和句中复杂的内部关系和模式互动关系,调整了我们高层次和高额候选人演进度演算。