In this paper, we consider the problem of temporal action localization under low-shot (zero-shot & few-shot) scenario, with the goal of detecting and classifying the action instances from arbitrary categories within some untrimmed videos, even not seen at training time. We adopt a Transformer-based two-stage action localization architecture with class-agnostic action proposal, followed by open-vocabulary classification. We make the following contributions. First, to compensate image-text foundation models with temporal motions, we improve category-agnostic action proposal by explicitly aligning embeddings of optical flows, RGB and texts, which has largely been ignored in existing low-shot methods. Second, to improve open-vocabulary action classification, we construct classifiers with strong discriminative power, i.e., avoid lexical ambiguities. To be specific, we propose to prompt the pre-trained CLIP text encoder either with detailed action descriptions (acquired from large-scale language models), or visually-conditioned instance-specific prompt vectors. Third, we conduct thorough experiments and ablation studies on THUMOS14 and ActivityNet1.3, demonstrating the superior performance of our proposed model, outperforming existing state-of-the-art approaches by one significant margin.
翻译:本文考虑了低样本场景下的时间动作定位问题,目标是检测和分类来自任意类别的动作实例,在一些未经修剪的视频中,即使在训练时没有见过。我们采用基于Transformer的两阶段动作定位架构,具有类别不可知的动作提议,随后进行开放式词汇分类。我们做出以下贡献。首先,为了弥补图像文本基础模型与时间运动之间的差距,我们通过明确地对其光流、RGB和文本的嵌入进行对齐,改进了类别不可知的动作提议,而这在现有的低样本方法中大多被忽略了。第二,为了提高开放式词汇动作分类,我们构建了具有强大区分能力的分类器,即避免了词汇歧义。具体而言,我们建议用细节动作描述(从大型语言模型中获取)或视觉条件下的实例特定提示向量提示预训练的CLIP文本编码器。第三,我们在THUMOS14和ActivityNet1.3上进行了彻底的实验和消融研究,证明了我们提出的模型的卓越性能,超过了现有的最先进方法一个显着的差距。