In this paper, we address self-supervised representation learning from human skeletons for action recognition. Previous methods, which usually learn feature presentations from a single reconstruction task, may come across the overfitting problem, and the features are not generalizable for action recognition. Instead, we propose to integrate multiple tasks to learn more general representations in a self-supervised manner. To realize this goal, we integrate motion prediction, jigsaw puzzle recognition, and contrastive learning to learn skeleton features from different aspects. Skeleton dynamics can be modeled through motion prediction by predicting the future sequence. And temporal patterns, which are critical for action recognition, are learned through solving jigsaw puzzles. We further regularize the feature space by contrastive learning. Besides, we explore different training strategies to utilize the knowledge from self-supervised tasks for action recognition. We evaluate our multi-task self-supervised learning approach with action classifiers trained under different configurations, including unsupervised, semi-supervised and fully-supervised settings. Our experiments on the NW-UCLA, NTU RGB+D, and PKUMMD datasets show remarkable performance for action recognition, demonstrating the superiority of our method in learning more discriminative and general features.
翻译:在本文中,我们从人类的骨骼中学习自我监督的代表性,以便采取行动的识别。 以往的方法通常是从单一的重建任务中学习特征的演示,这些方法通常会遇到过于适应的问题,而这些特征并不是一般的,无法被行动识别。 相反,我们提议将多种任务结合起来,以自我监督的方式学习更一般性的演示。为了实现这一目标,我们结合了运动预测、拼图拼图识别和对比学习,以便从不同方面学习骨骼特征。 Skeleton动态可以通过预测未来序列的运动预测来模拟。对于行动识别至关重要的时间模式可以通过解决拼图谜题来学习。我们通过对比性学习进一步规范特征空间。此外,我们探索不同的培训战略,利用从自我监督任务中获取的知识来进行行动识别。我们用在不同的配置下培训的行动分类师来评估我们的多任务自我监督的学习方法,包括未经监督、半监督和完全监控的分类师。 我们在NW-ULA、NTU RGB+D和PKUMMD中进行实验, 展示了我们在一般和高超度上学习高分化数据方法的出色表现。