Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models.In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.
翻译:许多最近的研究都利用经过预先训练的CLIP进行文字-视频跨模式检索,办法是通过额外的重模块调整主干线,不仅带来巨大的计算负担和更多的参数,而且还导致上游模型的遗忘。 在这项工作中,我们提议VoP: Text-Video合作快速调试,以便有效地调整文本-视频检索任务。拟议的VoP是一个端对端框架,同时引入视频和文本提示,这可以被视为一个强大的基线,只有0.1%的可训练参数。此外,根据视频的表面-时间特征,我们开发了三个新的视频提示机制,用不同尺度的可训练参数改进性能。VoP的基本想法是分别用特定的可训练提示来模拟框架位置、框架背景和层功能。广泛的实验显示,与全面微调相比,增强的VoP在五个文本-视频检索基准之间平均获得1.4%的R1收益,而6x的参数则较低。 代码将在 https://github.com/ighangOD.