Recent vision transformer based video models mostly follow the ``image pre-training then finetuning" paradigm and have achieved great success on multiple video benchmarks. However, full finetuning such a video model could be computationally expensive and unnecessary, given the pre-trained image transformer models have demonstrated exceptional transferability. In this work, we propose a novel method to Adapt pre-trained Image Models (AIM) for efficient video understanding. By freezing the pre-trained image model and adding a few lightweight Adapters, we introduce spatial adaptation, temporal adaptation and joint adaptation to gradually equip an image model with spatiotemporal reasoning capability. We show that our proposed AIM can achieve competitive or even better performance than prior arts with substantially fewer tunable parameters on four video action recognition benchmarks. Thanks to its simplicity, our method is also generally applicable to different image pre-trained models, which has the potential to leverage more powerful image foundation models in the future. The project webpage is \url{https://adapt-image-models.github.io/}.
翻译:最近基于视觉变压器的视频模型大多遵循“培训前图像和微调”模式,在多个视频基准方面取得了巨大成功。然而,完全微调这种视频模型可能会在计算上变得昂贵和不必要,因为经过培训的图像变压器模型已经表现出特殊的可转移性。在这项工作中,我们提出了一个新颖的方法来调整经过培训前的图像模型,以便有效的视频理解。通过冻结经过培训的图像模型并添加一些轻量级适应器,我们引入了空间适应、时间适应和联合适应,以逐步装备具有简易推理能力的图像模型。我们展示了我们提议的AIM在四个视频动作识别基准上比以前的艺术能够实现竞争,甚至更好的性能,其金枪鱼可参数大大低于以往的4个视频动作识别基准。由于它的简单性,我们的方法也普遍适用于不同的经过培训前图像模型,这些模型有可能在未来利用更强大的图像基础模型。项目网页是https://adapt-image-models.github.io/}。