We aim to address the problem of Natural Language Video Localization (NLVL)-localizing the video segment corresponding to a natural language description in a long and untrimmed video. State-of-the-art NLVL methods are almost in one-stage fashion, which can be typically grouped into two categories: 1) anchor-based approach: it first pre-defines a series of video segment candidates (e.g., by sliding window), and then does classification for each candidate; 2) anchor-free approach: it directly predicts the probabilities for each video frame as a boundary or intermediate frame inside the positive segment. However, both kinds of one-stage approaches have inherent drawbacks: the anchor-based approach is susceptible to the heuristic rules, further limiting the capability of handling videos with variant length. While the anchor-free approach fails to exploit the segment-level interaction thus achieving inferior results. In this paper, we propose a novel Boundary Proposal Network (BPNet), a universal two-stage framework that gets rid of the issues mentioned above. Specifically, in the first stage, BPNet utilizes an anchor-free model to generate a group of high-quality candidate video segments with their boundaries. In the second stage, a visual-language fusion layer is proposed to jointly model the multi-modal interaction between the candidate and the language query, followed by a matching score rating layer that outputs the alignment score for each candidate. We evaluate our BPNet on three challenging NLVL benchmarks (i.e., Charades-STA, TACoS and ActivityNet-Captions). Extensive experiments and ablative studies on these datasets demonstrate that the BPNet outperforms the state-of-the-art methods.
翻译:我们的目标是解决自然语言视频本地化(NLVL)问题,将视频部分与长长且未剪剪的视频中自然语言描述相对应的自然语言描述定位化(NLVL) 。 最新NLVL方法几乎是一阶段式的,通常可以分为两类:1) 锚基方法:首先预先确定一系列视频部分候选人(例如滑动窗口),然后对每个候选人进行分类; 2 无锚基方法:它直接预测每个视频框架的概率,作为正中部分的边界或中间框架。 然而,两种一阶段方法都有内在的缺陷:基于固定的NLVL方法容易受到超常规则的制约,进一步限制了以变异长度处理视频的能力。 虽然无锚方法无法利用部分层面的互动,从而取得低效的结果。 在本文中,我们提议一个新的边界建议网络(BPNet),一个通用的两阶段框架,在正阳性部分中,BPNet采用固定的模型,然后采用三阶段的Slock-realal-alalalalalalalal exal exal exal exal exal ex exal export lautes 。