Video grounding aims to localize the corresponding video moment in an untrimmed video given a language query. Existing methods often address this task in an indirect way, by casting it as a proposal-and-match or fusion-and-detection problem. Solving these surrogate problems often requires sophisticated label assignment during training and hand-crafted removal of near-duplicate results. Meanwhile, existing works typically focus on sparse video grounding with a single sentence as input, which could result in ambiguous localization due to its unclear description. In this paper, we tackle a new problem of dense video grounding, by simultaneously localizing multiple moments with a paragraph as input. From a perspective on video grounding as language conditioned regression, we present an end-to-end parallel decoding paradigm by re-purposing a Transformer-alike architecture (PRVG). The key design in our PRVG is to use languages as queries, and directly regress the moment boundaries based on language-modulated visual representations. Thanks to its simplicity in design, our PRVG framework can be applied in different testing schemes (sparse or dense grounding) and allows for efficient inference without any post-processing technique. In addition, we devise a robust proposal-level attention loss to guide the training of PRVG, which is invariant to moment duration and contributes to model convergence. We perform experiments on two video grounding benchmarks of ActivityNet Captions and TACoS, demonstrating that our PRVG can significantly outperform previous methods. We also perform in-depth studies to investigate the effectiveness of parallel regression paradigm on video grounding.
翻译:视频地面定位旨在将相应的视频时刻本地化, 在一个语言查询的未剪辑的视频中。 现有方法通常以间接方式处理这项任务, 将之作为建议和匹配或聚合和检测问题。 解决这些代理问题通常需要在培训和手工清除接近复制的结果过程中进行复杂的标签分配。 与此同时, 现有工作通常侧重于以单一句子作为输入的稀疏视频地面化, 这可能因其描述不明确而导致时间模糊化。 在本文中, 我们处理一个密集视频地面化的新问题, 其方法是同时将多个时刻本地化, 并用一段段落作为投入。 从视频地面定位的视角看, 作为语言条件回归的回归, 我们提出一个端到端的平行解码模式, 通过重新设计变异变器结构( PRVG ) 。 我们PRVG 的关键设计是使用语言作为查询, 直接回归基于语言变色图像表达的模型。 由于设计简洁, 我们的 PLVG 框架可以在不同的测试计划中应用不同的测试计划( 精细或密度的地面递增版数 ), 并且允许在前的地面演练中, 快速的图像定位分析中, 我们的轨变现的演练到任何方向的演练, 的演练到我们的任何演算方法可以帮助。