TSP: 本地化任务视频编程员临时敏感预备培训 (TSP: Temporally-Sensitive Pretraining of Video Encoders for Localization Tasks)

Understanding videos is challenging in computer vision. In particular, the large memory footprint of an untrimmed video makes most tasks infeasible to train end-to-end without dropping part of the input data. To cope with the memory limitation of commodity GPUs, current video localization models encode videos in an offline fashion. Even though these encoders are learned, they are typically trained for action classification tasks at the frame- or clip-level. Since it is difficult to finetune these encoders for other video tasks, they might be sub-optimal for temporal localization tasks. In this work, we propose a novel, supervised pretraining paradigm for clip-level video representation that does not only train to classify activities, but also considers background clips and global video information to gain temporal sensitivity. Extensive experiments show that features extracted by clip-level encoders trained with our novel pretraining task are more discriminative for several temporal localization tasks. Specifically, we show that using our newly trained features with state-of-the-art methods significantly improves performance on three tasks: Temporal Action Localization (+1.72% in average mAP on ActivityNet and +4.4% in mAP@0.5 on THUMOS14), Action Proposal Generation (+1.94% in AUC on ActivityNet), and Dense Video Captioning (+0.31% in average METEOR on ActivityNet Captions). We believe video feature encoding is an important building block for many video algorithms, and extracting meaningful features should be of paramount importance in the effort to build more accurate models.

翻译：理解视频在计算机视野中具有挑战性。特别是, 一个未剪辑视频的庞大记忆足迹使得大多数任务无法在不丢掉部分输入数据的情况下培训端到端的图像。为了应对商品 GPU 的记忆限制, 当前视频本地化模型以离线方式编码视频。尽管这些编码器是学习的, 它们通常在框架或剪辑级别上接受行动分类任务的培训。由于很难微调这些编码器用于其他视频任务, 它们可能是时间本地化任务的次最佳方法。在这项工作中, 我们提议了一个用于剪辑级视频特征的新型、监管的预培训模式, 不仅用于对活动进行分类培训, 而且还考虑背景剪辑和全球视频信息, 以脱机级方式对视频进行编程。广泛的实验表明, 由经过我们新培训的剪辑器在框架或剪辑任务中产生的特征对于一些时间本地化任务来说更具歧视性。具体地说, 我们用最新训练的本地化方法极大地提升了以下三项任务的绩效: Temal Action Cal Action (+1. 72%) 用于高级图像定位(在平均AP Alistrualalal Net+ Claction Applistrual) 行动MAlistrual 和DLisalation m4ADADADADADADADADADADLBADLADABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABABA

相关内容