Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.
翻译:尽管完全监督的人体骨骼序列建模取得了成功,但利用自我监督预训练进行骨骼序列表示学习一直是一个活跃的领域,因为在大规模获取特定任务的骨骼注释方面存在困难。最近的研究侧重于使用对比学习学习视频级别的时间和区分信息,但忽略了人体骨架的分层空间-时间结构。不同于视频级别的表层监督,我们提出了自我监督的分层预训练方案,将其纳入分层Transformer骨骼序列编码器(Hi-TRS)中,以明确地捕获在帧、剪辑和视频级别的空间、短期和长期时间依赖关系。为了评估Hi-TRS中提出的自我监督预训练方案,我们进行了广泛的实验,涵盖三个基于骨骼的下游任务,包括动作识别、动作检测和运动预测。在监督和半监督评估协议下,我们的方法均达到了最先进的性能。此外,我们表明我们的模型在预训练阶段学习的先前知识对于不同的下游任务具有强大的转移能力。