State-of-the-art video-text retrieval (VTR) methods usually fully fine-tune the pre-trained model (e.g. CLIP) on specific datasets, which may suffer from substantial storage costs in practical applications since a separate model per task needs to be stored. To overcome this issue, we present the premier work on performing parameter-efficient VTR from the pre-trained model, i.e., only a small number of parameters are tunable while freezing the backbone. Towards this goal, we propose a new method dubbed Multimodal Video Adapter (MV-Adapter) for efficiently transferring the knowledge in the pre-trained CLIP from image-text to video-text. Specifically, MV-Adapter adopts bottleneck structures in both video and text branches and introduces two novel components. The first is a Temporal Adaptation Module employed in the video branch to inject global and local temporal contexts. We also learn weights calibrations to adapt to the dynamic variations across frames. The second is a Cross-Modal Interaction Module that generates weights for video/text branches through a shared parameter space, for better aligning between modalities. Thanks to above innovations, MV-Adapter can achieve on-par or better performance than standard fine-tuning with negligible parameters overhead. Notably, on five widely used VTR benchmarks (MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet), MV-Adapter consistently outperforms various competing methods in V2T/T2V tasks with large margins. Codes will be released.
翻译:为了克服这一问题,我们介绍了从预先培训的模型(即,只有少量参数是可以捕捉的,同时冻结骨干。为了实现这一目标,我们提议采用新的方法,称为多式视频调整器(MV-Adapter),以便有效地将预培训的大型视频调整器(MV-Adapter)中的知识从图像文本向视频文本转移。具体地说,MV-Adapter在视频和文本分支中采用瓶颈结构,并引入两个新的组成部分。第一个是视频分支中用于输入全球和地方时间环境的温度适应模块。我们还学习了适应跨框架的动态变化的重量校正校正校正校正校正(MV-Adapter ) 。第二个是跨式互动模块,通过共享的参数将视频/视频调整器的重量从图像文本向视频文本转移到视频文本文本。Md-D的高级视频调整器,可以更好调整VD-SD标准操作方式。