Image-text pretrained models, e.g., CLIP, have shown impressive general multi-modal knowledge learned from large-scale image-text data pairs, thus attracting increasing attention for their potential to improve visual representation learning in the video domain. In this paper, based on the CLIP model, we revisit temporal modeling in the context of image-to-video knowledge transferring, which is the key point for extending image-text pretrained models to the video domain. We find that current temporal modeling mechanisms are tailored to either high-level semantic-dominant tasks (e.g., retrieval) or low-level visual pattern-dominant tasks (e.g., recognition), and fail to work on the two cases simultaneously. The key difficulty lies in modeling temporal dependency while taking advantage of both high-level and low-level knowledge in CLIP model. To tackle this problem, we present Spatial-Temporal Auxiliary Network (STAN) -- a simple and effective temporal modeling mechanism extending CLIP model to diverse video tasks. Specifically, to realize both low-level and high-level knowledge transferring, STAN adopts a branch structure with decomposed spatial-temporal modules that enable multi-level CLIP features to be spatial-temporally contextualized. We evaluate our method on two representative video tasks: Video-Text Retrieval and Video Recognition. Extensive experiments demonstrate the superiority of our model over the state-of-the-art methods on various datasets, including MSR-VTT, DiDeMo, LSMDC, MSVD, Kinetics-400, and Something-Something-V2. Codes will be available at https://github.com/farewellthree/STAN
翻译:在本文中,我们根据CLIP模式,在图像到视频知识传输的背景下重新审视时间建模,这是将图像到视频领域扩展成预设模型的关键点。我们发现,当前的时间建模机制是针对从大规模图像文本数据配对中获取的高级语义-主导性任务(例如,检索)或低层次视觉模式-主导性任务(例如,承认)的,因此吸引了越来越多的注意力,以其潜力改善视频领域的视觉教学学习潜力。在本文件中,我们根据CLIP模式,在图像到视频知识传输模式中重新审视时间建模,这是将图像文本预设模型推广到视频领域的关键点。我们发现,当前的时间建模机制是简单有效的时间建模机制,将CLIP模式扩展到多种视频任务(例如,检索)或低层次图像-模式-主导性任务(例如,承认),并且未能同时处理这两个案件。主要困难在于利用CLIP模式和低层次-Sintal数据传输模式,而Sintal-Syal-deal-deal-Syal-deal-deal-deal-devial-deal-deal-deal-sal-devial-de-de-de-de-devial-de-de-devial-de-devial-deal-sal-sal-sal-sal-sal-de-de-de-sal-sal-sal-Sy-s-s-sal-sal-sal-s-s-s-sal-sal-sal-svial-sal-sal-s-s-sal-s-s-s-s-s-s-s-s-Sy-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-sal-s-sal-sal-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-s-sl-sl-Sl-sl-s-s-s-s-s-s-s-