Learning from large-scale contrastive language-image pre-training like CLIP has shown remarkable success in a wide range of downstream tasks recently, but it is still under-explored on the challenging few-shot action recognition (FSAR) task. In this work, we aim to transfer the powerful multimodal knowledge of CLIP to alleviate the inaccurate prototype estimation issue due to data scarcity, which is a critical problem in low-shot regimes. To this end, we present a CLIP-guided prototype modulating framework called CLIP-FSAR, which consists of two key components: a video-text contrastive objective and a prototype modulation. Specifically, the former bridges the task discrepancy between CLIP and the few-shot video task by contrasting videos and corresponding class text descriptions. The latter leverages the transferable textual concepts from CLIP to adaptively refine visual prototypes with a temporal Transformer. By this means, CLIP-FSAR can take full advantage of the rich semantic priors in CLIP to obtain reliable prototypes and achieve accurate few-shot classification. Extensive experiments on five commonly used benchmarks demonstrate the effectiveness of our proposed method, and CLIP-FSAR significantly outperforms existing state-of-the-art methods under various settings. The source code and models will be publicly available at https://github.com/alibaba-mmai-research/CLIP-FSAR.
翻译:从CLIP等大型对比性语言图像培训前的大规模学习中发现,最近,在一系列广泛的下游任务中,CLIP取得了显著的成功,但是,由于具有挑战性的微小行动识别(FSAR)任务,我们仍在探索中。在这项工作中,我们的目标是转移CLIP强大的多式知识,以缓解数据稀缺造成的不准确原型估算问题,这是低发制度中的一个关键问题。为此,我们提出了一个CLIP-指导的原型调制框架,称为CLIP-FSAR,由两个关键组成部分组成:视频文本对比目标和原型调制。具体地说,前者通过对比视频和相应的类文本描述,弥合CLIP和少数图像任务之间的任务差异。后者利用CLIP的可转让文本概念,以适应性地改进视觉原型,在低发制度下是一个关键问题。为此,CLIP-FSAR可以充分利用CIP的丰富原型,以获得可靠的原型,并实现精确的几发分解分类。在五种通用的SARIP基准下进行广泛的实验。</s>