Existing temporal action detection (TAD) methods rely on large training data including segment-level annotations, limited to recognizing previously seen classes alone during inference. Collecting and annotating a large training set for each class of interest is costly and hence unscalable. Zero-shot TAD (ZS-TAD) resolves this obstacle by enabling a pre-trained model to recognize any unseen action classes. Meanwhile, ZS-TAD is also much more challenging with significantly less investigation. Inspired by the success of zero-shot image classification aided by vision-language (ViL) models such as CLIP, we aim to tackle the more complex TAD task. An intuitive method is to integrate an off-the-shelf proposal detector with CLIP style classification. However, due to the sequential localization (e.g, proposal generation) and classification design, it is prone to localization error propagation. To overcome this problem, in this paper we propose a novel zero-Shot Temporal Action detection model via Vision-LanguagE prompting (STALE). Such a novel design effectively eliminates the dependence between localization and classification by breaking the route for error propagation in-between. We further introduce an interaction mechanism between classification and localization for improved optimization. Extensive experiments on standard ZS-TAD video benchmarks show that our STALE significantly outperforms state-of-the-art alternatives. Besides, our model also yields superior results on supervised TAD over recent strong competitors. The PyTorch implementation of STALE is available at https://github.com/sauradip/STALE.
翻译:现有时间行动检测方法依赖于大型培训数据,包括部分级别说明,仅限于在推断过程中仅识别先前所见的类别。收集和批注每个利益类别大型培训集的成本很高,因此无法伸缩。 Zeroshot TAD (ZS-TAD) 使一个经过预先培训的模型能够识别任何看不见的行动类别,从而解决了这一障碍。与此同时,ZS-TAD 也具有更大的挑战性,因为调查要少得多。受CLIP等视觉语言模型(View-Lual)帮助的零发图像分类的成功启发,我们的目标是应对更为复杂的TAD任务。一种直观的方法是将现成的建议检测器与CLIIP风格分类相结合。然而,由于一个经过事先培训的模型(例如,建议生成)和分类设计,ZS-TA(S-TA) 也容易被本地级分类(STAAD) 和本地级分类(STADLA) 之间新的零热度检测模型检测模型,我们提出一个新的零点/Temoal Adal A-alalizalalalalalalal ladalalizalizalizalizal lading lading the lading lax lax lax lavealdal ladal ladal lax lax laut laut laut ladaldaldaldal ladaldal ladal laut ladaldal ladal laut ladaldaldaldaldaldaldaldaldaldaldaldaldal lautdal lauts lauts lauts ladaldaldaldaldaldaldaldaldaldaldaldaldal ladaldaldal ladal ladaldaldaldaldal ladal ladal ladaldaldaldaldaldal ladal ladal ladal ladaldal ladaldaldaldaldaldal lauts laut ladal