Many recent studies leverage the pre-trained CLIP for text-video cross-modal retrieval by tuning the backbone with additional heavy modules, which not only brings huge computational burdens with much more parameters, but also leads to the knowledge forgetting from upstream models. In this work, we propose the VoP: Text-Video Co-operative Prompt Tuning for efficient tuning on the text-video retrieval task. The proposed VoP is an end-to-end framework with both video & text prompts introducing, which can be regarded as a powerful baseline with only 0.1% trainable parameters. Further, based on the spatio-temporal characteristics of videos, we develop three novel video prompt mechanisms to improve the performance with different scales of trainable parameters. The basic idea of the VoP enhancement is to model the frame position, frame context, and layer function with specific trainable prompts, respectively. Extensive experiments show that compared to full fine-tuning, the enhanced VoP achieves a 1.4% average R@1 gain across five text-video retrieval benchmarks with 6x less parameter overhead. The code will be available at https://github.com/bighuang624/VoP.
翻译:许多最近的研究利用预训练的CLIP在文本-视频跨模态检索中,通过使用额外的繁重模块来调整骨干网络,这不仅带来了巨大的计算负担和更多的参数,而且还导致了上游模型的知识丢失。在本文中,我们提出了VoP:文本-视频协作提示调整,用于有效地调整文本-视频检索任务。所提出的VoP是一个端对端的框架,具有视频和文本提示,可以被认为是一个只有0.1%可训练参数的强大基线。此外,基于视频的时空特性,我们开发了三种新的视频提示机制,以提高具有不同可训练参数规模的性能。 VoP增强的基本思想是,分别使用特定的可训练提示模型建模帧位置、帧上下文和层函数。大量实验证明,与完全微调相比,增强的VoP在5种文本 - 视频检索基准测试中平均提高1.4%的R@1性能,且参数开销只有6倍之差。代码将在https://github.com/bighuang624/VoP上公开。