蒙面视频蒸馏:重新思考自我监督视频演示学习的蒙面特色建模 (Masked Video Distillation: Rethinking Masked Feature Modeling for Self-supervised Video Representation Learning)

Benefiting from masked visual modeling, self-supervised video representation learning has achieved remarkable progress. However, existing methods focus on learning representations from scratch through reconstructing low-level features like raw pixel RGB values. In this paper, we propose masked video distillation (MVD), a simple yet effective two-stage masked feature modeling framework for video representation learning: firstly we pretrain an image (or video) model by recovering low-level features of masked patches, then we use the resulting features as targets for masked feature modeling. For the choice of teacher models, we observe that students taught by video teachers perform better on temporally-heavy video tasks, while image teachers transfer stronger spatial representations for spatially-heavy video tasks. Visualization analysis also indicates different teachers produce different learned patterns for students. Motivated by this observation, to leverage the advantage of different teachers, we design a spatial-temporal co-teaching method for MVD. Specifically, we distill student models from both video teachers and image teachers by masked feature modeling. Extensive experimental results demonstrate that video transformers pretrained with spatial-temporal co-teaching outperform models distilled with a single teacher on a multitude of video datasets. Our MVD with vanilla ViT achieves state-of-the-art performance compared with previous supervised or self-supervised methods on several challenging video downstream tasks. For example, with the ViT-Large model, our MVD achieves 86.4% and 75.9% Top-1 accuracy on Kinetics-400 and Something-Something-v2, outperforming VideoMAE by 1.2% and 1.6% respectively. Code will be available at \url{https://github.com/ruiwang2021/mvd}.

翻译：从隐形视觉建模中受益的自我监督视频代表制学习取得了显著进展。然而,现有方法侧重于从零到零的学习演示,重建低级别特征,如原始像素 RGB 值。在本文中,我们提议了隐形视频蒸馏(MVD),这是一个简单而有效的双阶段隐型模型框架,用于视频代表制学习:首先,我们通过恢复隐型遮盖的低级别补丁的特征,预演一个图像(或视频)模型,然后我们使用由此产生的特征作为掩面特征建模的目标。在选择教师模型时,我们观察到由视频教师教授的学生在时间上表现得更好,在时间上超重的视频任务方面,而图像教师则为空间超重的视频蒸馏(MVD)任务提供更有力的空间展示。视觉分析还表明,不同的教师为学生创造不同的学习模式。根据这一观察,我们设计了一个用于MVD的低级别掩码(或视频)共授方法。具体地,我们从视频教师和图像教师的变现模型中提取了视频教师的图像模型,通过掩码模型分别实现Sloveal-r-dal-reval-dal-dal-dal-d-d-drode-dal-modal-modal-modal-mod-mod-mod-modal mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-modal-mod-mod-modal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-modal-modal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-modal-mod-mod-mod-mod-mod-modal-mod-modal-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-mod-