This paper strives for self-supervised learning of a feature space suitable for skeleton-based action recognition. Our proposal is built upon learning invariances to input skeleton representations and various skeleton augmentations via a noise contrastive estimation. In particular, we propose inter-skeleton contrastive learning, which learns from multiple different input skeleton representations in a cross-contrastive manner. In addition, we contribute several skeleton-specific spatial and temporal augmentations which further encourage the model to learn the spatio-temporal dynamics of skeleton data. By learning similarities between different skeleton representations as well as augmented views of the same sequence, the network is encouraged to learn higher-level semantics of the skeleton data than when only using the augmented views. Our approach achieves state-of-the-art performance for self-supervised learning from skeleton data on the challenging PKU and NTU datasets with multiple downstream tasks, including action recognition, action retrieval and semi-supervised learning. Code is available at https://github.com/fmthoker/skeleton-contrast.
翻译:本文力求以自我监督的方式学习适合基于骨骼的行动识别的特征空间。 我们的建议基于学习不易通过噪音对比性估计输入骨骼表象和各种骨骼增强的元素。 我们特别建议通过跨交式的方式,从多种不同的输入骨骼中学习。 此外,我们贡献了几种因骨骼而异的空间和时间增益,进一步鼓励模型学习骨骼数据的时空动态。通过学习不同骨骼表象之间的相似性以及同一序列的扩大观点,鼓励网络学习更高级别骨骼数据的语义学,而不是仅仅使用扩大的视角。我们的方法实现了从具有挑战性的PKU和NTU数据集的骨骼数据中进行自我监督学习的状态性能,包括行动识别、行动检索和半受监督的学习。 代码可在 https://github.com/fmthoker/skeleton-contrast.