Recently, there has been a growing interest in predicting human motion, which involves forecasting future body poses based on observed pose sequences. This task is complex due to modeling spatial and temporal relationships. The most commonly used models for this task are autoregressive models, such as recurrent neural networks (RNNs) or variants, and Transformer Networks. However, RNNs have several drawbacks, such as vanishing or exploding gradients. Other researchers have attempted to solve the communication problem in the spatial dimension by integrating Graph Convolutional Networks (GCN) and Long Short-Term Memory (LSTM) models. These works deal with temporal and spatial information separately, which limits the effectiveness. To fix this problem, we propose a novel approach called the multi-graph convolution network (MGCN) for 3D human pose forecasting. This model simultaneously captures spatial and temporal information by introducing an augmented graph for pose sequences. Multiple frames give multiple parts, joined together in a single graph instance. Furthermore, we also explore the influence of natural structure and sequence-aware attention to our model. In our experimental evaluation of the large-scale benchmark datasets, Human3.6M, AMSS and 3DPW, MGCN outperforms the state-of-the-art in pose prediction.
翻译:最近,人们越来越关注预测人体运动,这涉及根据观察到的姿势序列预测未来身体姿势的复杂任务,因为要建模空间和时间关系。这项任务的最常用模型是自回归模型,如循环神经网络(RNN)或变体,和Transformer网络。然而,RNN有几个缺点,例如梯度消失或梯度爆炸。其他研究人员尝试通过将图卷积网络(GCN)和长短期记忆(LSTM)模型集成来解决空间维度中的通信问题。这些作品将时间和空间信息单独处理,限制了有效性。为了解决这个问题,我们提出了一种称为多图卷积网络(MGCN)的新方法,用于3D人体姿态预测。该模型通过为姿势序列引入增强图来同时捕捉空间和时间信息。多帧给出多个部分,在单个图实例中连接在一起。此外,我们还探索了自然结构和序列感知注意力对我们模型的影响。在对大规模基准数据集Human3.6M、AMSS和3DPW的实验评估中,MGCN优于姿势预测的最新技术。