细调 CLIP 模型是高效的视频学习器 (Fine-tuned CLIP Models are Efficient Video Learners)

Large-scale multi-modal training with image-text pairs imparts strong generalization to CLIP model. Since training on a similar scale for videos is infeasible, recent approaches focus on the effective transfer of image-based CLIP to the video domain. In this pursuit, new parametric modules are added to learn temporal information and inter-frame relationships which require meticulous design efforts. Furthermore, when the resulting models are learned on videos, they tend to overfit on the given task distribution and lack in generalization aspect. This begs the following question: How to effectively transfer image-level CLIP representations to videos? In this work, we show that a simple Video Fine-tuned CLIP (ViFi-CLIP) baseline is generally sufficient to bridge the domain gap from images to videos. Our qualitative analysis illustrates that the frame-level processing from CLIP image-encoder followed by feature pooling and similarity matching with corresponding text embeddings helps in implicitly modeling the temporal cues within ViFi-CLIP. Such fine-tuning helps the model to focus on scene dynamics, moving objects and inter-object relationships. For low-data regimes where full fine-tuning is not viable, we propose a `bridge and prompt' approach that first uses fine-tuning to bridge the domain gap and then learns prompts on language and vision side to adapt CLIP representations. We extensively evaluate this simple yet strong baseline on zero-shot, base-to-novel generalization, few-shot and fully supervised settings across five video benchmarks. Our code is available at https://github.com/muzairkhattak/ViFi-CLIP.

翻译：大规模的图像-文本对多模态训练为 CLIP 模型提供了较强的泛化能力。由于采用类似的规模进行视频训练是不可行的，因此近期的方法聚焦于有效地将基于图像的 CLIP 转移到视频领域。为此，添加了新的参数模块以学习时间信息和帧间关系，这需要精心设计。此外，当所得到的模型在视频上进行学习时，往往会在给定的任务分布上过拟合，并且缺乏泛化方面的能力。这引出以下问题：如何将图像级别的 CLIP 表示有效地转移到视频上？在这项工作中，我们展示了一个简单的视频细调 CLIP (ViFi-CLIP) 基线通常足以弥合从图像到视频的领域差距。我们的定性分析说明，从 CLIP 图像编码器得到帧级处理，然后通过特征汇聚和对应文本嵌入的相似度匹配，有助于隐含地对 ViFi-CLIP 内的时间线索进行建模。这样的细调有助于模型聚焦于场景动态、移动物体和物体间关系。对于无数据区间，不适合完全细调的情况，我们提出了一种“桥接和提示”的方法，首先使用细调来弥合领域差距，然后在语言和视觉方面学习提示来适应 CLIP 表示。我们在五个视频基准测试上广泛评估了这个简单但强大的基线，包括零样本、基准-新颖概括、少样本和完全监督设置。我们的代码可在 https://github.com/muzairkhattak/ViFi-CLIP 上找到。