Human skeleton point clouds are commonly used to automatically classify and predict the behaviour of others. In this paper, we use a contrastive self-supervised learning method, SimCLR, to learn representations that capture the semantics of skeleton point clouds. This work focuses on systematically evaluating the effects that different algorithmic decisions (including augmentations, dataset partitioning and backbone architecture) have on the learned skeleton representations. To pre-train the representations, we normalise six existing datasets to obtain more than 40 million skeleton frames. We evaluate the quality of the learned representations with three downstream tasks: skeleton reconstruction, motion prediction, and activity classification. Our results demonstrate the importance of 1) combining spatial and temporal augmentations, 2) including additional datasets for encoder training, and 3) and using a graph neural network as an encoder.
翻译:人类骨骼点云通常用于自动分类和预测他人的行为。 在本文中,我们使用对比式自我监督的学习方法SimCLR, 来学习捕捉骨骼点云的语义的表达方式。 这项工作侧重于系统评估不同的算法决定( 包括扩增、 数据集分割和主干结构)对所学骨架表达方式的影响。 为了预先培训演示, 我们将现有的六个数据集正常化, 以获得超过4,000万个骨架框。 我们用三种下游任务( 骨架重建、 运动预测和活动分类) 来评估学习过的表达方式的质量。 我们的结果表明 (1) 结合空间和时间增强, 2 包括用于编码培训的额外数据集, 以及 3 以及使用图形神经网络作为编码器的重要性 。