We present a new paradigm named explore-and-match for video grounding, which aims to seamlessly unify two streams of video grounding methods: proposal-based and proposal-free. To achieve this goal, we formulate video grounding as a set prediction problem and design an end-to-end trainable Video Grounding Transformer (VidGTR) that can utilize the architectural strengths of rich contextualization and parallel decoding for set prediction. The overall training is balanced by two key losses that play different roles, namely span localization loss and set guidance loss. These two losses force each proposal to regress the target timespan and identify the target query. Throughout the training, VidGTR first explores the search space to diversify the initial proposals and then matches the proposals to the corresponding targets to fit them in a fine-grained manner. The explore-and-match scheme successfully combines the strengths of two complementary methods, without encoding prior knowledge into the pipeline. As a result, VidGTR sets new state-of-the-art results on two video grounding benchmarks with double the inference speed.
翻译:我们提出了一个名为视频定位探索和匹配的新范式,旨在无缝地统一两组视频定位方法:基于建议和无建议。为了实现这一目标,我们将视频定位设计成一个设定的预测问题,并设计一个端到端可培训的视频定位变异器(VidGTR),该变异器可以利用丰富背景化和平行解码的建筑优势来进行设定预测。总体培训由发挥不同作用的两个关键损失(即覆盖本地化损失和设定指导损失)相平衡。这两个损失迫使每个建议重复目标时间跨度并确定目标查询。在整个培训过程中,VidGTR首先探索搜索空间,使初始建议多样化,然后将建议与相应目标匹配,以微小化的方式适应它们。探索和匹配计划成功地结合了两种互补方法的优势,而没有将先前的知识编码到管道中。结果是,VidGTR在两个视频定位基准上设定了新的状态结果,将推导速度翻一倍。