This paper addresses the problem of natural language video localization (NLVL). Almost all existing works follow the "only look once" framework that exploits a single model to directly capture the complex cross- and self-modal relations among video-query pairs and retrieve the relevant segment. However, we argue that these methods have overlooked two indispensable characteristics of an ideal localization method: 1) Frame-differentiable: considering the imbalance of positive/negative video frames, it is effective to highlight positive frames and weaken negative ones during the localization. 2) Boundary-precise: to predict the exact segment boundary, the model should capture more fine-grained differences between consecutive frames since their variations are often smooth. To this end, inspired by how humans perceive and localize a segment, we propose a two-step human-like framework called Skimming-Locating-Perusing (SLP). SLP consists of a Skimming-and-Locating (SL) module and a Bi-directional Perusing (BP) module. The SL module first refers to the query semantic and selects the best matched frame from the video while filtering out irrelevant frames. Then, the BP module constructs an initial segment based on this frame, and dynamically updates it by exploring its adjacent frames until no frame shares the same activity semantic. Experimental results on three challenging benchmarks show that our SLP is superior to the state-of-the-art methods and localizes more precise segment boundaries.
翻译:本文讨论了自然语言视频本地化问题(NLVL)。几乎所有现有作品都遵循“只看一眼”的框架,该框架利用单一模型直接捕捉视频询问配对之间复杂的交叉和自我模式关系,并检索相关部分。然而,我们认为,这些方法忽视了理想本地化方法的两个不可或缺的特征:(1) 框架可区别:考虑到正/负视频框架的不平衡,在本地化期间突出正/负框架和削弱负框架是有效的。(2) 边界预测:为了预测确切的段边界,模型应捕捉更细的连续框架之间的差异,因为其差异往往很平滑。为此,我们根据人类对一段的感知和本地化,提出了两步类似人类的框架,称为Skimming-定位-秘鲁(SLP)。SLP由一个Skimming-L定位模块和双向秘鲁(BBP)模块组成。SL模块首先提到查询的语义化和选择最匹配的框架,然后从这个不相干的部分框架中选择一个最匹配的框架,然后用不相近的图像框架来显示一个动态的Storimal-L框架,然后显示一个不相交的Silal 活动框架。