Pre-training video transformers on extra large-scale datasets is generally required to achieve premier performance on relatively small datasets. In this paper, we show that video masked autoencoders (VideoMAE) are data-efficient learners for self-supervised video pre-training (SSVP). We are inspired by the recent ImageMAE and propose customized video tube masking with an extremely high ratio. This simple design makes video reconstruction a more challenging self-supervision task, thus encouraging extracting more effective video representations during this pre-training process. We obtain three important findings on SSVP: (1) An extremely high proportion of masking ratio (i.e., 90% to 95%) still yields favorable performance of VideoMAE. The temporally redundant video content enables a higher masking ratio than that of images. (2) VideoMAE achieves impressive results on very small datasets (i.e., around 3k-4k videos) without using any extra data. (3) VideoMAE shows that data quality is more important than data quantity for SSVP. Domain shift between pre-training and target datasets is an important issue. Notably, our VideoMAE with the vanilla ViT can achieve 85.8% on Kinetics-400, 75.3% on Something-Something V2, 90.8% on UCF101, and 61.1% on HMDB51, without using any extra data. Code is available at https://github.com/MCG-NJU/VideoMAE.
翻译:在超大型数据集方面,通常需要预先培训视频变压器,才能在相对较小的数据集上取得优异性能。 在本文中,我们显示,视频蒙面自动读数器(VideoMAE)是自我监督视频预培训(SSVP)的数据高效学习者。我们受到最近的图像MAE的启发,并提议以极高的比例进行定制的视频管遮罩。这种简单设计使视频重建更具挑战性的自我监督任务,从而鼓励在这一培训前的进程中提取更有效的视频演示。我们在SSVP上取得了三项重要发现:(1) 隐藏率(即,90%至95%)的比例极高,仍然能产生视频MAE的优异性性性性能。时间冗余视频内容使得比图像高的遮罩率。(2) 视频MA在非常小的数据集(即约3k-4k视频)上取得了令人印象深刻的结果,而没有使用任何额外数据。(3) 视频MAE比数据数量更重要。 在VIVPP.8.8和目标数据设置上, VanMA.