The ability to identify and temporally segment fine-grained actions in motion capture sequences is crucial for applications in human movement analysis. Motion capture is typically performed with optical or inertial measurement systems, which encode human movement as a time series of human joint locations and orientations or their higher-order representations. State-of-the-art action segmentation approaches use multiple stages of temporal convolutions. The main idea is to generate an initial prediction with several layers of temporal convolutions and refine these predictions over multiple stages, also with temporal convolutions. Although these approaches capture long-term temporal patterns, the initial predictions do not adequately consider the spatial hierarchy among the human joints. To address this limitation, we present multi-stage spatial-temporal graph convolutional neural networks (MS-GCN). Our framework decouples the architecture of the initial prediction generation stage from the refinement stages. Specifically, we replace the initial stage of temporal convolutions with spatial-temporal graph convolutions, which better exploit the spatial configuration of the joints and their temporal dynamics. Our framework was compared to four strong baselines on five tasks. Experimental results demonstrate that our framework achieves state-of-the-art performance.
翻译:确定运动捕捉序列中的细微分系微动作的能力对于人类运动分析的应用至关重要。 运动捕捉通常使用光学或惯性测量系统进行,这些系统将人类运动编码为人类联合地点和方向或其较高级显示的时序序列。 状态行动分解方法使用多种时间变相阶段。 主要想法是用若干层时间变相进行初步预测,并改进多个阶段的预测,也采用时间变相。 虽然这些方法可以捕捉长期的时间模式,但最初的预测没有充分考虑到人类联合体之间的空间等级。 为了解决这一局限性,我们提出了多阶段空间-时空图共振神经网络(MS-GCN)。 我们的框架将初始预测生成阶段的结构与完善阶段分开。 具体地说,我们用空间-时序图共振动的最初阶段来取代时间变相,更好地利用联合空间配置及其时间动态。我们的框架与五个任务的4个强基线进行了比较。 实验结果显示,我们的框架实现了我们的状态。