Natural Language Video Grounding (NLVG) aims to localize time segments in an untrimmed video according to sentence queries. In this work, we present a new paradigm named Explore-And-Match for NLVG that seamlessly unifies the strengths of two streams of NLVG methods: proposal-free and proposal-based; the former explores the search space to find time segments directly, and the latter matches the predefined time segments with ground truths. To achieve this, we formulate NLVG as a set prediction problem and design an end-to-end trainable Language Video Transformer (LVTR) that can enjoy two favorable properties, which are rich contextualization power and parallel decoding. We train LVTR with two losses. First, temporal localization loss allows time segments of all queries to regress targets (explore). Second, set guidance loss couples every query with their respective target (match). To our surprise, we found that training schedule shows divide-and-conquer-like pattern: time segments are first diversified regardless of the target, then coupled with each target, and fine-tuned to the target again. Moreover, LVTR is highly efficient and effective: it infers faster than previous baselines (by 2X or more) and sets competitive results on two NLVG benchmarks (ActivityCaptions and Charades-STA). Codes are available at https://github.com/sangminwoo/Explore-And-Match.
翻译:自然语言视频定位( NLVG) 旨在根据句号查询将时间段本地化成一个未剪接的视频。 在这项工作中,我们展示了一个新的范式,名为“为 NLVG 探索和匹配”,无缝地统一了NLVG方法的两个流体的优点:无建议和基于建议的方法;前者探索搜索空间以直接寻找时间段,而后者则与预定义的时间段匹配地面真相。为了实现这一点,我们将NLVG 设计成一个设定的预测问题,并设计一个端到端的可培训语言视频变异器(LVTR),可以享受两种有利的属性,即丰富的背景化能力和平行解码能力。我们用两种损失来培训LVTR。首先,时间本地化损失可以让所有查询的时间段的回归目标( 深度 ) 。 其次, 设定每次查询的指导性损失组合与各自的目标( 量 ) 。 我们惊讶的是, 我们发现培训时间表显示分解和折叠式模式: 时间段首先多样化, 与目标无关, 然后与每个目标一起, 精确地调整到目标。 另外, TRBSBSBRVA 和CRVDRB 较快的基线 。 (在前 和前两个 。