Human skeleton, as a compact representation of human action, has received increasing attention in recent years. Many skeleton-based action recognition methods adopt graph convolutional networks (GCN) to extract features on top of human skeletons. Despite the positive results shown in previous works, GCN-based methods are subject to limitations in robustness, interoperability, and scalability. In this work, we propose PoseC3D, a new approach to skeleton-based action recognition, which relies on a 3D heatmap stack instead of a graph sequence as the base representation of human skeletons. Compared to GCN-based methods, PoseC3D is more effective in learning spatiotemporal features, more robust against pose estimation noises, and generalizes better in cross-dataset settings. Also, PoseC3D can handle multiple-person scenarios without additional computation cost, and its features can be easily integrated with other modalities at early fusion stages, which provides a great design space to further boost the performance. On four challenging datasets, PoseC3D consistently obtains superior performance, when used alone on skeletons and in combination with the RGB modality.
翻译:近些年来,许多基于骨骼的行动识别方法(GCN)采用了图形变异网络(GCN)来提取人类骨骼上的特征。尽管以往的工作取得了积极的成果,但基于GCN的方法在稳健性、互操作性和可缩放性方面受到限制。在这项工作中,我们提议采用新的方法,即基于骨骼的行动识别方法(PoseC3D),该方法依赖于3D热映射堆,而不是作为人类骨骼基本代表的图形序列。与基于GCN的方法相比,PoseC3D在学习波形时空特征方面更为有效,更有力地应对表面估计的噪音,并在交叉数据设置中更加普及。此外,PoseC3D可以在不增加计算成本的情况下处理多人的情景,其特征可以很容易地与其他模式结合到早期的聚合阶段,这为进一步提升性能提供了巨大的设计空间。在四种具有挑战性的数据集方面,PoseC3D在单独用于骨架和与RGB模式结合时,始终获得优异性性表现。