The growing number of action classes has posed a new challenge for video understanding, making Zero-Shot Action Recognition (ZSAR) a thriving direction. The ZSAR task aims to recognize target (unseen) actions without training examples by leveraging semantic representations to bridge seen and unseen actions. However, due to the complexity and diversity of actions, it remains challenging to semantically represent action classes and transfer knowledge from seen data. In this work, we propose an ER-enhanced ZSAR model inspired by an effective human memory technique Elaborative Rehearsal (ER), which involves elaborating a new concept and relating it to known concepts. Specifically, we expand each action class as an Elaborative Description (ED) sentence, which is more discriminative than a class name and less costly than manual-defined attributes. Besides directly aligning class semantics with videos, we incorporate objects from the video as Elaborative Concepts (EC) to improve video semantics and generalization from seen actions to unseen actions. Our ER-enhanced ZSAR model achieves state-of-the-art results on three existing benchmarks. Moreover, we propose a new ZSAR evaluation protocol on the Kinetics dataset to overcome limitations of current benchmarks and demonstrate the first case where ZSAR performance is comparable to few-shot learning baselines on this more realistic setting. We will release our codes and collected EDs at https://github.com/DeLightCMU/ElaborativeRehearsal.
翻译:越来越多的行动课对视频理解提出了新的挑战,使零热行动识别(ZSAR)成为了一个新的挑战,使零热行动识别(ZSAR)成为了一个繁荣的方向。ZSAR的任务旨在通过利用语义表达方式连接可见的和看不见的行动,在没有培训实例的情况下确认目标(不见)行动;然而,由于行动的复杂性和多样性,仍然难以在语义上代表行动类别和从所见数据中转让知识。在这项工作中,我们提出一个由有效的人类记忆技术所启发的ER增强的ZSAR模型,它涉及一个新概念,并将它与已知的概念联系起来。具体地说,我们扩大每个行动类,将它作为精细描述(ED)的句子,比一个阶级名称更具有歧视性,比手写属性更便宜。除了直接将课堂语义与视频相匹配,我们还将视频中的目标作为解析概念(EC),以改进视频语义的语义描述和从我们所看到的动作到看不见的行动。我们的ERSAR精细阅读模型在三个现有基准上实现状态-艺术结果的结果结果。我们提议在SAR数据库中进行新的学习基准的排序基准,在SAR数据库中进行新的学习基准到ZDAR的比。此外,我们将在SAR数据库中提出新的基准中提出一个新的基准比。在SARDRSAR在SAR数据库中提出一个新的基准中将展示新的标准到ZDARDRADRBDRBDRDDDDRDRDDDDDDDDDDDDDDDSDSDDDDDDDDDDDSDDDDDDDDDDD的比。