Capitalizing on large pre-trained models for various downstream tasks of interest have recently emerged with promising performance. Due to the ever-growing model size, the standard full fine-tuning based task adaptation strategy becomes prohibitively costly in terms of model training and storage. This has led to a new research direction in parameter-efficient transfer learning. However, existing attempts typically focus on downstream tasks from the same modality (e.g., image understanding) of the pre-trained model. This creates a limit because in some specific modalities, (e.g., video understanding) such a strong pre-trained model with sufficient knowledge is less or not available. In this work, we investigate such a novel cross-modality transfer learning setting, namely parameter-efficient image-to-video transfer learning. To solve this problem, we propose a new Spatio-Temporal Adapter (ST-Adapter) for parameter-efficient fine-tuning per video task. With a built-in spatio-temporal reasoning capability in a compact design, ST-Adapter enables a pre-trained image model without temporal knowledge to reason about dynamic video content at a small (~8%) per-task parameter cost, requiring approximately 20 times fewer updated parameters compared to previous work. Extensive experiments on video action recognition tasks show that our ST-Adapter can match or even outperform the strong full fine-tuning strategy and state-of-the-art video models, whilst enjoying the advantage of parameter efficiency. The code and model are available at https://github.com/linziyi96/st-adapter
翻译:将大型培训前模型用于各种下游任务,最近出现了有希望的成绩;由于模型规模不断扩大,标准全面微调基于任务适应战略在模型培训和存储方面成本过高,这导致在参数效率转移学习方面出现了新的研究方向。然而,现有的尝试通常侧重于与培训前模型相同模式(如图像理解)的下游任务。这造成了一个限度,因为在某些特定模式(如视频理解)中,这种具备足够知识的强力预先培训模型越来越少或根本没有。在这项工作中,我们调查这种新型跨模式转移学习设置,即参数效率图像到视频传输学习学习。为了解决这个问题,我们提议一个新的Spatio-时间调整器(ST-Adapter),用于对每个视频任务进行参数效率微调。由于在一些特定模式(如视频理解)中,St-Adapter能够使事先培训的图像模型模型更小,而没有时间上关于动态视频内容的知识。A-8-toim 更新的校正校正任务,需要小的Starial-ta-ta-lax mex imal lax a lagistral ex ex lavial ex ex lavial ex lagistral ex ex lavial ex lagistral ex ex fortistrual ex laview ex fal ex ex ex fal ex fal ex ex fal ex fal ex fal expaltistrisal ex ex exx expal exitaltistrolvioltraxx ex ex ex exx exx exx exx exf exx semstr semstr sal exfal exfal exal exal exal exal exal exal exal exal exal exactal exal exstrolaldal exal exal exal exactal exal exactal exactal exactal fal ex ex ex ex exal exal exal exal 20xxxxxx a exxal ex a ex