ST-Adapter: 参数-有效图像到视频传输学习促进行动识别 (ST-Adapter: Parameter-Efficient Image-to-Video Transfer Learning for Action Recognition)

Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (~8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-the-art video models, whilst enjoying the advantage of parameter efficiency.

翻译：由于模型规模不断扩大,标准全面微调基于任务适应战略在模式培训和存储方面成本过高,因此我们调查了这种新型的跨模式传输学习设置,即节能图像到视频传输学习。为了解决这个问题,我们提议一个新的Spatio-Temporal Reformater (ST-Adapter),用于对每个视频任务进行节能微调。由于在某种特定模式(例如,视频理解)中,这种具有足够知识的强强力的预培训型(如,视频理解),因此产生了一个局限性,因为在某些特定模式中,这种具备足够知识的强的精度预培训型(例如,视频理解)不易获得。在这项工作中,我们调查这种新型的跨模式传输学习设置,即节能图像到视频传输的学习学习。为了解决这个问题,我们提议一个新的Spatio-Temporal Reformater(ST-Adapter),用于对每个视频任务进行节能性微调。ST-ad-Adromography-traction-tragle ask-train laft laft laft ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask ask smodvidustruction smodvidustrubal sibal sistrital ask ask straction straction straction ask ask ask ask ask straction straction smod straction ask straction straction straction straction straction ask ask straction ask ask ask ask ask ask ask sal sistrital fistrital sibal sistr sal fistral fistral sistral combal) ask ask ask ask ask sal ex sal subal sal sal sal sal sal sal sal ex sal imd sal sal sal ask ask ask ask ask ask ask ask ask ask sal ex ask a