Modern multi-layer perceptron (MLP) models have shown competitive results in learning visual representations without self-attention. However, existing MLP models are not good at capturing local details and lack prior knowledge of human body configurations, which limits their modeling power for skeletal representation learning. To address these issues, we propose a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation. GraphMLP incorporates the graph structure of human bodies into an MLP model to meet the domain-specific demand of the 3D human pose, while allowing for both local and global spatial interactions. Furthermore, we propose to flexibly and efficiently extend the GraphMLP to the video domain and show that complex temporal dynamics can be effectively modeled in a simple way with negligible computational cost gains in the sequence length. To the best of our knowledge, this is the first MLP-Like architecture for 3D human pose estimation in a single frame and a video sequence. Extensive experiments show that the proposed GraphMLP achieves state-of-the-art performance on two datasets, i.e., Human3.6M and MPI-INF-3DHP. Code and models are available at https://github.com/Vegetebird/GraphMLP.
翻译:现代多层感知机(MLP)模型已经显示出在学习视觉表示方面不使用自我注意力时具有竞争力的结果。然而,现有的MLP模型不擅长捕捉局部细节,并缺乏对人体配置的先验知识,这限制了它们对骨骼表示学习的建模能力。为了解决这些问题,我们提出了一种简单而有效的类图形-支持MLP架构,称为GraphMLP,它在一个全局-局部-图形统一体架构中将MLP和图形卷积网络(GCN)相结合,用于三维人体姿态估计。GraphMLP将人体图形结构纳入MLP模型中,以满足三维人体姿态的特定领域需求,同时允许局部和全局空间相互作用。此外,我们提出了一种灵活高效地将GraphMLP扩展到视频领域的方法,并展示了在序列长度上几乎没有计算成本增益的情况下可以有效地建模复杂的时间动态。据我们所知,这是首个单帧和视频序列中用于三维人体姿态估计的MLP-Like架构。大量实验表明,所提出的GraphMLP在两个数据集Human3.6M和MPI-INF-3DHP上达到了最先进的性能。代码和模型可在https://github.com/Vegetebird/GraphMLP 上获得。