学习部分SPatio-Temporal Skeleton序列的自监督行动代表 (Self-supervised Action Representation Learning from Partial Spatio-Temporal Skeleton Sequences)

Self-supervised learning has demonstrated remarkable capability in representation learning for skeleton-based action recognition. Existing methods mainly focus on applying global data augmentation to generate different views of the skeleton sequence for contrastive learning. However, due to the rich action clues in the skeleton sequences, existing methods may only take a global perspective to learn to discriminate different skeletons without thoroughly leveraging the local relationship between different skeleton joints and video frames, which is essential for real-world applications. In this work, we propose a Partial Spatio-Temporal Learning (PSTL) framework to exploit the local relationship from a partial skeleton sequences built by a unique spatio-temporal masking strategy. Specifically, we construct a negative-sample-free triplet steam structure that is composed of an anchor stream without any masking, a spatial masking stream with Central Spatial Masking (CSM), and a temporal masking stream with Motion Attention Temporal Masking (MATM). The feature cross-correlation matrix is measured between the anchor stream and the other two masking streams, respectively. (1) Central Spatial Masking discards selected joints from the feature calculation process, where the joints with a higher degree of centrality have a higher possibility of being selected. (2) Motion Attention Temporal Masking leverages the motion of action and remove frames that move faster with a higher possibility. Our method achieves state-of-the-art performance on NTURGB+D 60, NTURGB+D 120 and PKU-MMD under various downstream tasks. Furthermore, a practical evaluation is performed where some skeleton joints are lost in downstream tasks.In contrast to previous methods that suffer from large performance drops, our PSTL can still achieve remarkable results under this challenging setting, validating the robustness of our method.

翻译：自我监督的学习显示,在为基于骨骼的行动识别而进行代表学习方面,其能力是惊人的。现有方法主要侧重于应用全球数据扩增来生成对骨架序列的不同观点以进行对比式学习。然而,由于骨架序列中有大量行动线索,现有方法可能只是从全球角度来学习歧视不同的骨架,而没有彻底利用对真实世界应用至关重要的不同骨架连接和视频框架之间的本地关系。在这项工作中,我们提议了一个部分Spatio-时间学习(PSTL)框架,以利用由独特的spatio-时空掩罩战略所建立的局部骨架序列来建立的地方关系。具体地说,我们构建了一个由固定流组成的不设遮掩罩的三重蒸汽结构,一个与中央空间遮掩罩(CSMSM)相隔开的空间遮掩罩流,一个时间遮掩罩流流,这是真实世界应用的温度遮掩罩(MATM) 。在锚流和另外两个掩罩流之间测量特征的特性矩阵,分别是: (1) 中央空间掩埋高空D(PD) 选择的G) 在深度计算过程中, 直径直路路路路路路路路段中可以实现我们前的移动的运行,, 实现一个共同操作的运行, 直路路段的路径可以使我们前的方法可以使我们前进的轨道走向走向走向的轨道的轨道的轨道走向移动移动移动。