Skeleton-based action recognition has drawn a lot of attention for its computation efficiency and robustness to lighting conditions. Existing skeleton-based action recognition methods are typically formulated as a one-hot classification task without fully utilizing the semantic relations between actions. For example, "make victory sign" and "thumb up" are two actions of hand gestures, whose major difference lies in the movement of hands. This information is agnostic from the categorical one-hot encoding of action classes but could be unveiled in the language description of actions. Therefore, utilizing action language descriptions in training could potentially benefit representation learning. In this work, we propose a Language Supervised Training (LST) approach for skeleton-based action recognition. More specifically, we employ a large-scale language model as the knowledge engine to provide text descriptions for body parts movements of actions, and propose a multi-modal training scheme by utilizing the text encoder to generate feature vectors for different body parts and supervise the skeleton encoder for action representation learning. Experiments show that our proposed LST method achieves noticeable improvements over various baseline models without extra computation cost at inference. LST achieves new state-of-the-arts on popular skeleton-based action recognition benchmarks, including NTU RGB+D, NTU RGB+D 120 and NW-UCLA. The code can be found at https://github.com/MartinXM/LST.
翻译:以 Skeleton 为基础的行动识别已经引起人们对其计算效率和对照明条件的稳健度的高度重视。 现有的基于骨架的行动识别方法通常在不充分利用行动之间的语义关系的情况下,作为一热分类任务来制定。 例如,“ 制造胜利标志”和“ 踢起来”是手势的两种动作, 手势的主要区别在于手动。 这种信息来自对行动类别绝对一热编码,但可以在行动的语言描述中公布。 因此, 在培训中使用行动语言描述可能会有利于代表学习。 在这项工作中,我们建议采用语言监督培训(LST) 方法, 用于基于骨架的行动识别。 更具体地说, 我们使用大语言模型作为知识引擎, 提供身体部分行动动作的文字描述, 并提议一个多模式培训计划, 利用文本编码器为不同身体部分生成特性矢量, 并监督行动代表学习的骨架编码。 实验表明,我们拟议的LST 方法可以在不增加计算成本的情况下, 使各种基线模型得到显著的改进, 包括IMV- GB+ TU 基准 。