双向跨模式知识探索,利用预先培训的愿景语言模型进行视频识别双向双向跨模式知识探索 (Bidirectional Cross-Modal Knowledge Exploration for Video Recognition with Pre-trained Vision-Language Models)

Vision-language models (VLMs) that are pre-trained on large-scale image-text pairs have demonstrated impressive transferability on a wide range of visual tasks. Transferring knowledge from such powerful pre-trained VLMs is emerging as a promising direction for building effective video recognition models. However, the current exploration is still limited. In our opinion, the greatest charm of pre-trained vision-language models is to build a bridge between visual and textual domains. In this paper, we present a novel framework called BIKE which utilizes the cross-modal bridge to explore bidirectional knowledge: i) We propose a Video Attribute Association mechanism which leverages the Video-to-Text knowledge to generate textual auxiliary attributes to complement video recognition. ii) We also present a Temporal Concept Spotting mechanism which uses the Text-to-Video expertise to capture temporal saliency in a parameter-free manner to yield enhanced video representation. The extensive studies on popular video datasets (ie, Kinetics-400 & 600, UCF-101, HMDB-51 and ActivityNet) show that our method achieves state-of-the-art performance in most recognition scenarios, eg, general, zero-shot, and few-shot video recognition. To the best of our knowledge, our best model achieves a state-of-the-art accuracy of 88.4% on challenging Kinetics-400 with the released CLIP pre-trained model.

翻译：在大型图像-文本配对上预先培训的视觉语言模型(VLMS)在大规模图像-文本配对上显示在广泛的视觉任务上具有令人印象深刻的可转让性。从这种强大的、经过训练的VLMS转让知识正在成为建立有效视频识别模型的一个有希望的方向。然而,目前的探索仍然有限。我们认为,经过培训的视觉语言模型的最大魅力是在视觉和文字领域之间建立一座桥梁。在本文中,我们提出了一个名为BIKE的新框架,它利用跨模式桥梁探索双向知识:i)我们提议一个视频属性协会机制,利用视频到文本知识生成文本辅助属性,以补充视频识别模型。我们还提出了一个Temal-Video概念观察机制,它利用文本到视频的专业知识,以无参数的方式获取时间亮度,从而产生更大的视频代表。关于流行视频数据集的广泛研究(i、Kinitics-400 & 600、UCFC-101、HMDB-51和活动Net)显示我们最具有挑战性、最有挑战性、最能识别、最先进、最能-直观、最能-直观、最能-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-了解-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-直观-