Temporal sentence localization in videos (TSLV) aims to retrieve the most interested segment in an untrimmed video according to a given sentence query. However, almost of existing TSLV approaches suffer from the same limitations: (1) They only focus on either frame-level or object-level visual representation learning and corresponding correlation reasoning, but fail to integrate them both; (2) They neglect to leverage the rich semantic contexts to further benefit the query reasoning. To address these issues, in this paper, we propose a novel Hierarchical Visual- and Semantic-Aware Reasoning Network (HVSARN), which enables both visual- and semantic-aware query reasoning from object-level to frame-level. Specifically, we present a new graph memory mechanism to perform visual-semantic query reasoning: For visual reasoning, we design a visual graph memory to leverage visual information of video; For semantic reasoning, a semantic graph memory is also introduced to explicitly leverage semantic knowledge contained in the classes and attributes of video objects, and perform correlation reasoning in the semantic space. Experiments on three datasets demonstrate that our HVSARN achieves a new state-of-the-art performance.
翻译:视频(TSLV)中的时间句定位旨在根据给定句询问,在未剪辑的视频中检索最感兴趣的部分,但是,几乎所有现有的TSLV方法都受到同样的限制:(1) 它们只侧重于框架级或目标级的视觉表现学习和相应的相关推理,但未能将两者结合起来;(2) 它们忽视利用丰富的语义背景来进一步有利于查询推理; 为了解决这些问题,我们在本文件中提议建立一个新型的等级性视觉和语义软件解释网络(HVSARN), 使视觉和语义认知的查询推理都能从目标级到框架级。 具体地说, 我们提出一个新的图形记忆机制来进行视觉- 语义查询推理: 关于视觉推理,我们设计了一个视觉图形记忆来利用视频的视觉信息; 关于语义推理学,还引入了语义图记忆,以明确利用视频对象班级和属性中所含的语义知识,并在语义空间中进行相关推理。</s>