Few-shot (FS) and zero-shot (ZS) learning are two different approaches for scaling temporal action detection (TAD) to new classes. The former adapts a pretrained vision model to a new task represented by as few as a single video per class, whilst the latter requires no training examples by exploiting a semantic description of the new class. In this work, we introduce a new multi-modality few-shot (MMFS) TAD problem, which can be considered as a marriage of FS-TAD and ZS-TAD by leveraging few-shot support videos and new class names jointly. To tackle this problem, we further introduce a novel MUlti-modality PromPt mETa-learning (MUPPET) method. This is enabled by efficiently bridging pretrained vision and language models whilst maximally reusing already learned capacity. Concretely, we construct multi-modal prompts by mapping support videos into the textual token space of a vision-language model using a meta-learned adapter-equipped visual semantics tokenizer. To tackle large intra-class variation, we further design a query feature regulation scheme. Extensive experiments on ActivityNetv1.3 and THUMOS14 demonstrate that our MUPPET outperforms state-of-the-art alternative methods, often by a large margin. We also show that our MUPPET can be easily extended to tackle the few-shot object detection problem and again achieves the state-of-the-art performance on MS-COCO dataset. The code will be available in https://github.com/sauradip/MUPPET
翻译:少样本学习和零样本学习是用于为新类别进行时间动作检测(TAD)的两种不同方法。前者将预先训练过的视觉模型适应于由一个视频表示的新任务,而后者通过利用新类别的语义描述而无需训练样本。在本文中,我们引入了一个新的多模态少样本(MMFS)TAD问题,它可以通过联合利用少量支持视频和新类别名称来考虑FS-TAD和ZS-TAD的融合。为了解决这个问题,我们进一步介绍了一种新的MUlti-modality PromPt mETa-learning(MUPPET)方法。这是通过高效地桥接预训练的视觉和语言模型以及最大程度地重用已经学习的容量来实现的。具体而言,我们使用元学习的适配器配备的视觉语义标记器将支持视频映射到视觉-语言模型的文本令牌空间中,构建多模态提示。为了解决大的类内变异,我们进一步设计了一个查询特征调节方案。在ActivityNetv1.3和THUMOS14上的广泛实验表明,我们的MUPPET在很多情况下都胜过了最先进的替代方法。我们还展示了我们的MUPPET可以轻松扩展到解决少样本对象检测问题,并且在MS-COCO数据集上再次实现了最先进的性能。代码将在https://github.com/sauradip/MUPPET上提供。