Many video analysis tasks require temporal localization thus detection of content changes. However, most existing models developed for these tasks are pre-trained on general video action classification tasks. This is because large scale annotation of temporal boundaries in untrimmed videos is expensive. Therefore no suitable datasets exist for temporal boundary-sensitive pre-training. In this paper for the first time, we investigate model pre-training for temporal localization by introducing a novel boundary-sensitive pretext (BSP) task. Instead of relying on costly manual annotations of temporal boundaries, we propose to synthesize temporal boundaries in existing video action classification datasets. With the synthesized boundaries, BSP can be simply conducted via classifying the boundary types. This enables the learning of video representations that are much more transferable to downstream temporal localization tasks. Extensive experiments show that the proposed BSP is superior and complementary to the existing action classification based pre-training counterpart, and achieves new state-of-the-art performance on several temporal localization tasks.
翻译:许多视频分析任务需要时间本地化,从而探测内容的变化。然而,为这些任务开发的大多数现有模型都是在一般视频行动分类任务上预先培训的。这是因为在未剪辑的视频中大规模地说明时间边界是昂贵的。因此,没有合适的数据集可用于时间上对边界敏感的预先培训。在本文件中,我们首次通过引入新的边界敏感借口(BSP)任务,对时间本地化示范培训进行时间本地化培训。我们不依靠昂贵的时间边界人工说明,而是建议在现有视频行动分类数据集中综合时间边界。在综合的边界中,BSP可以简单地通过边界类型分类进行。这样可以学习更加可转用于下游时间本地化任务的视频显示。广泛的实验表明,拟议的BSP是基于培训前对应方的现有行动分类的优劣和补充,并在一些时间本地化任务上实现新的状态性表现。