Skeleton-based action recognition is widely used in varied areas, e.g., surveillance and human-machine interaction. Existing models are mainly learned in a supervised manner, thus heavily depending on large-scale labeled data which could be infeasible when labels are prohibitively expensive. In this paper, we propose a novel Contrast-Reconstruction Representation Learning network (CRRL) that simultaneously captures postures and motion dynamics for unsupervised skeleton-based action recognition. It mainly consists of three parts: Sequence Reconstructor, Contrastive Motion Learner, and Information Fuser. The Sequence Reconstructor learns representation from skeleton coordinate sequence via reconstruction, thus the learned representation tends to focus on trivial postural coordinates and be hesitant in motion learning. To enhance the learning of motions, the Contrastive Motion Learner performs contrastive learning between the representations learned from coordinate sequence and additional velocity sequence, respectively. Finally, in the Information Fuser, we explore varied strategies to combine the Sequence Reconstructor and Contrastive Motion Learner, and propose to capture postures and motions simultaneously via a knowledge-distillation based fusion strategy that transfers the motion learning from the Contrastive Motion Learner to the Sequence Reconstructor. Experimental results on several benchmarks, i.e., NTU RGB+D 60, NTU RGB+D 120, CMU mocap, and NW-UCLA, demonstrate the promise of the proposed CRRL method by far outperforming state-of-the-art approaches.
翻译:以Skeleton为基础的行动识别(CRRL)在各个领域广泛使用,例如监视和人体机械互动。现有模型主要以监督的方式学习,因此在很大程度上依赖于大规模标签数据,如果标签价格高得令人望而却步,这种数据在标签价格高得令人望而却步。在本文中,我们建议建立一个新型对比重建代表学习网络,同时捕捉态势和运动动态,以获得不受监督的骨骼行动识别。它主要由三个部分组成:序列重组、对比运动学习者和信息用户。序列重组者通过重建从骨架协调序列中学习代表,因此,学习的代表性往往侧重于微不足道的后装坐标,在运动学习时犹豫不决。为了加强对运动的学习,对比运动学习者分别从协调序列和额外速度序列中学习的演示。最后,在信息用户中,我们探索各种战略,将序列重组、对比运动学习状态和对比动作学习者结合起来,并提议通过一系列学习的变压战略,从变动的变压战略,从变压的变压后学习。