Due to the large memory footprint of untrimmed videos, current state-of-the-art video localization methods operate atop precomputed video clip features. These features are extracted from video encoders typically trained for trimmed action classification tasks, making such features not necessarily suitable for temporal localization. In this work, we propose a novel supervised pretraining paradigm for clip features that not only trains to classify activities but also considers background clips and global video information to improve temporal sensitivity. Extensive experiments show that using features trained with our novel pretraining strategy significantly improves the performance of recent state-of-the-art methods on three tasks: Temporal Action Localization, Action Proposal Generation, and Dense Video Captioning. We also show that our pretraining approach is effective across three encoder architectures and two pretraining datasets. We believe video feature encoding is an important building block for localization algorithms, and extracting temporally-sensitive features should be of paramount importance in building more accurate models. The code and pretrained models are available on our project website.
翻译:由于未剪辑的视频的记忆足迹巨大,目前最先进的视频本地化方法在预先制作的视频剪辑片段功能上运行,这些功能摘自通常为剪裁行动分类任务而培训的视频编码器,使这些特征不一定适合时间本地化。在这项工作中,我们提议为剪辑特征建立一个新的、受监督的预培训模式,这些剪辑器不仅用于对活动进行分类培训,而且还考虑到背景剪辑和全球视频信息,以提高时间敏感性。广泛的实验表明,使用经过我们新颖培训前战略培训的特征,极大地改进了最新最先进的三种任务:时间行动本地化、行动提案生成和Dense视频描述方法的性能。我们还表明,我们的预培训方法在三个编码器结构和两个预培训数据集中是有效的。我们认为,视频特征编码是本地化算法的重要基石,提取时间敏感特征对于建立更准确的模型至关重要。我们的项目网站上有代码和预培训模式。