共同培训带有视频和图像的变革器,改进行动识别 (Co-training Transformer with Videos and Images Improves Action Recognition)

In learning action recognition, models are typically pre-trained on object recognition with images, such as ImageNet, and later fine-tuned on target action recognition with videos. This approach has achieved good empirical performance especially with recent transformer-based video architectures. While recently many works aim to design more advanced transformer architectures for action recognition, less effort has been made on how to train video transformers. In this work, we explore several training paradigms and present two findings. First, video transformers benefit from joint training on diverse video datasets and label spaces (e.g., Kinetics is appearance-focused while SomethingSomething is motion-focused). Second, by further co-training with images (as single-frame videos), the video transformers learn even better video representations. We term this approach as Co-training Videos and Images for Action Recognition (CoVeR). In particular, when pretrained on ImageNet-21K based on the TimeSFormer architecture, CoVeR improves Kinetics-400 Top-1 Accuracy by 2.4%, Kinetics-600 by 2.3%, and SomethingSomething-v2 by 2.3%. When pretrained on larger-scale image datasets following previous state-of-the-art, CoVeR achieves best results on Kinetics-400 (87.2%), Kinetics-600 (87.9%), Kinetics-700 (79.8%), SomethingSomething-v2 (70.9%), and Moments-in-Time (46.1%), with a simple spatio-temporal video transformer.

翻译：在学习动作识别中,模型通常在图像的物体识别方面接受先期培训,例如图像网络,然后通过视频对目标动作识别进行微调。这一方法已经取得了良好的实证性表现,特别是最近以变压器为基础的视频结构。最近许多工作的目的是设计更先进的变压器结构,以采取行动识别,但在如何培训视频变压器方面却没有作出更多的努力。在这项工作中,我们探索了一些培训模式并提出了两项发现。首先,视频变压器受益于不同视频数据集和标签空间的联合培训(例如,动画仪以表面为重点,而某些东西以运动为重点)。第二,通过图像(作为单一框架的视频)的进一步共同培训,视频变压器学到了更好的视频表达方式。我们将这种方法称为对视频和图像进行共同培训以行动识别(CoVeR),特别是,在根据TimeSFormer结构对图像网络21K进行预先培训时,CoVeR改进了Sentialtics-400 Top-Cecurence, 以2.4%、Kinitical-600为焦点-600、2.8%和KinestiveSerma-trading2),在前实现了2.3%上取得了2.3%。