Temporal Activity Detection aims to predict activity classes per frame, in contrast to video-level predictions as done in Activity Classification (i.e., Activity Recognition). Due to the expensive frame-level annotations required for detection, the scale of detection datasets is limited. Thus, commonly, previous work on temporal activity detection resorts to fine-tuning a classification model pretrained on large-scale classification datasets (e.g., Kinetics-400). However, such pretrained models are not ideal for downstream detection performance due to the disparity between the pretraining and the downstream fine-tuning tasks. This work proposes a novel self-supervised pretraining method for detection leveraging classification labels to mitigate such disparity by introducing frame-level pseudo labels, multi-action frames, and action segments. We show that the models pretrained with the proposed self-supervised detection task outperform prior work on multiple challenging activity detection benchmarks, including Charades and MultiTHUMOS. Our extensive ablations further provide insights on when and how to use the proposed models for activity detection. Code and models will be released online.
翻译:与活动分类(即活动识别)中的视频水平预测相比,时间活动探测旨在预测每个框架的活动类别,与活动分类(即活动识别)中的视频级别预测不同。由于检测所需的框架层次说明费用昂贵,探测数据集的规模有限。因此,通常,以往关于时间活动探测的工作采用对大规模分类数据集(如动因-400)预先培训的分类模型进行微调的办法。然而,由于培训前和下游微调任务之间的差距,这种预先培训的模型对于下游探测性能并不理想。这项工作提出了一种新的自我监督的检测利用分类标签的训练前方法,通过采用框架层次的假标签、多动作框架和行动部分来缩小这种差异。我们表明,这些模型事先经过了对大规模分类数据集(例如动因-400)进行自我监督的检测任务的培训,比以前关于具有挑战性的活动探测基准(包括Charades和MultiTHHUMOS)的工作要好。我们将在线发布关于何时和如何使用拟议活动探测模型的深入了解。代码和模型。我们将在网上发布。