We propose LocFormer, a Transformer-based model for video grounding which operates at a constant memory footprint regardless of the video length, i.e. number of frames. LocFormer is designed for tasks where it is necessary to process the entire long video and at its core lie two main contributions. First, our model incorporates a new sampling technique that splits the input feature sequence into a fixed number of sections and selects a single feature per section using a stochastic approach, which allows us to obtain a feature sample set that is representative of the video content for the task at hand while keeping the memory footprint constant. Second, we propose a modular design that separates functionality, enabling us to learn an inductive bias via supervising the self-attention heads, while also effectively leveraging pre-trained text and video encoders. We test our proposals on relevant benchmark datasets for video grounding, showing that not only LocFormer can achieve excellent results including state-of-the-art performance on YouCookII, but also that our sampling technique is more effective than competing counterparts and that it consistently improves the performance of prior work, by up to 3.13\% in the mean temporal IoU, ultimately leading to a new state-of-the-art performance on Charades-STA.
翻译:我们提议了LocFormer, 一种基于变异器的视频定位模型, 不论视频长度, 都以恒定的记忆足迹运行, 即框架数。 LocFormer 的设计是针对需要处理整个长视频的任务设计的, 而其核心是两个主要贡献。 首先, 我们的模式包含一种新的抽样技术, 将输入特征序列分成固定数量的部分, 并使用随机方法选择每个部分的单一特征, 从而使我们能够获得一个代表手头任务视频内容的特征样本集, 同时保持记忆足迹不变。 其次, 我们提议了一个模块设计, 将功能分开, 使我们能够通过监管自省头来学习感性偏差, 同时有效地利用预先培训的文本和视频编码器。 我们测试了我们关于相关基准数据集的建议, 用于视频定位, 显示不仅 LocFormer 能够取得极好的结果, 包括YouCookII 的状态性能, 而且我们的采样技术比竞争对手更有效, 并且它能够不断改进先前工作的性能, 通过监管自控头头的状态, 最终到 Char- 。