Despite the success of fully-supervised human skeleton sequence modeling, utilizing self-supervised pre-training for skeleton sequence representation learning has been an active field because acquiring task-specific skeleton annotations at large scales is difficult. Recent studies focus on learning video-level temporal and discriminative information using contrastive learning, but overlook the hierarchical spatial-temporal nature of human skeletons. Different from such superficial supervision at the video level, we propose a self-supervised hierarchical pre-training scheme incorporated into a hierarchical Transformer-based skeleton sequence encoder (Hi-TRS), to explicitly capture spatial, short-term, and long-term temporal dependencies at frame, clip, and video levels, respectively. To evaluate the proposed self-supervised pre-training scheme with Hi-TRS, we conduct extensive experiments covering three skeleton-based downstream tasks including action recognition, action detection, and motion prediction. Under both supervised and semi-supervised evaluation protocols, our method achieves the state-of-the-art performance. Additionally, we demonstrate that the prior knowledge learned by our model in the pre-training stage has strong transfer capability for different downstream tasks.
翻译:尽管完全监督的人类骨骼序列建模工作取得了成功,但利用自我监督的先期训练来进行骨架序列学习是一个积极的领域,因为很难在大尺度上获得特定任务的骨架说明。最近的研究侧重于利用对比性学习视频级的时间和歧视性信息,但忽视了人类骨骼的等级空间时空性质。与这种在视频层面的表面监督不同,我们提议了一个自我监督的先期培训计划,纳入基于骨架的等级变压序列编码器(Hi-TRS),以在框架、剪辑和视频层面分别明确捕捉到空间、短期和长期的时间依赖性。为了评估拟议的与H-TRS的自我监督前培训计划,我们进行了广泛的实验,涉及三个基于骨架的下游任务,包括行动识别、行动探测和运动预测。在受监督和半监督的评价协议下,我们的方法达到了以技术为基础的状态性表现。此外,我们证明我们模型在培训前阶段所学的先前知识具有很强的转移不同下游任务的能力。