Current self-supervised approaches for skeleton action representation learning often focus on constrained scenarios, where videos and skeleton data are recorded in laboratory settings. When dealing with estimated skeleton data in real-world videos, such methods perform poorly due to the large variations across subjects and camera viewpoints. To address this issue, we introduce ViA, a novel View-Invariant Autoencoder for self-supervised skeleton action representation learning. ViA leverages motion retargeting between different human performers as a pretext task, in order to disentangle the latent action-specific `Motion' features on top of the visual representation of a 2D or 3D skeleton sequence. Such `Motion' features are invariant to skeleton geometry and camera view and allow ViA to facilitate both, cross-subject and cross-view action classification tasks. We conduct a study focusing on transfer-learning for skeleton-based action recognition with self-supervised pre-training on real-world data (e.g., Posetics). Our results showcase that skeleton representations learned from ViA are generic enough to improve upon state-of-the-art action classification accuracy, not only on 3D laboratory datasets such as NTU-RGB+D 60 and NTU-RGB+D 120, but also on real-world datasets where only 2D data are accurately estimated, e.g., Toyota Smarthome, UAV-Human and Penn Action.
翻译:在实验室环境中记录视频和骨骼数据时,当处理现实世界视频中估计骨骼数据时,由于科目和摄像器视角的差异很大,这些方法效果不佳。为了解决这一问题,我们引入了Via,这是用于自我监督骨骼行动代表学习的新型查看-内性自动编码器。 Via 将不同人类表演者之间的运动重新定位作为借口任务,以便解开2D或3D骨架序列的视觉显示层上的潜在行动特定“运动”特征。这些“运动”特征对骨骼测量和相机视图视图视图和图像视图视图的视图不同,无法使ViA促进跨主题和跨视角行动分类任务。我们开展一项研究,重点是通过自我监督的预先培训,将骨架行动识别作为真实世界数据(例如Posetical)的借口。我们的结果显示,从ViA中学习的骨架展示足够通用,可以改进州-e-deform-de-D的行动分类,这种“运动”功能特征特征特征不易变,而只是用于实验室-RG-TU-D数据(如N-TUR-TUR-T-D)的准确数据。我们进行的一项研究研究研究,只是实验室数据。