Temporal language grounding (TLG) aims to localize a video segment in an untrimmed video based on a natural language description. To alleviate the expensive cost of manual annotations for temporal boundary labels, we are dedicated to the weakly supervised setting, where only video-level descriptions are provided for training. Most of the existing weakly supervised methods generate a candidate segment set and learn cross-modal alignment through a MIL-based framework. However, the temporal structure of the video as well as the complicated semantics in the sentence are lost during the learning. In this work, we propose a novel candidate-free framework: Fine-grained Semantic Alignment Network (FSAN), for weakly supervised TLG. Instead of view the sentence and candidate moments as a whole, FSAN learns token-by-clip cross-modal semantic alignment by an iterative cross-modal interaction module, generates a fine-grained cross-modal semantic alignment map, and performs grounding directly on top of the map. Extensive experiments are conducted on two widely-used benchmarks: ActivityNet-Captions, and DiDeMo, where our FSAN achieves state-of-the-art performance.
翻译:时间语言定位(TLG)旨在根据自然语言描述将视频段本地化。为了降低时间边界标签人工说明的成本成本,我们致力于那些监管不力、只提供视频级别的描述用于培训的薄弱环境。大多数现有的薄弱监督方法产生一个候选部分,通过基于MIL的框架学习跨模式调整。然而,视频的时间结构以及句子中复杂的语义在学习过程中丢失了。在这项工作中,我们提出了一个新的无候选人框架:精细的语义调整网络(FSAN),用于监管不力的TLG。我们不是将句子和候选时间作为一个整体看,而是通过一个互动的跨模式互动模块学习逐字形跨式语义调整,生成一个精细的跨模式语义校正图,并在地图上直接进行地面定位。在两种广泛使用的基准上进行了广泛的实验:活动网络和DieMo,我们FSAN实现了州的业绩。