Modern multi-layer perceptron (MLP) models have shown competitive results in learning visual representations without self-attention. However, existing MLP models are not good at capturing local details and lack prior knowledge of human body configurations, which limits their modeling power for skeletal representation learning. To address these issues, we propose a simple yet effective graph-reinforced MLP-Like architecture, named GraphMLP, that combines MLPs and graph convolutional networks (GCNs) in a global-local-graphical unified architecture for 3D human pose estimation. GraphMLP incorporates the graph structure of human bodies into an MLP model to meet the domain-specific demand of the 3D human pose, while allowing for both local and global spatial interactions. Furthermore, we propose to flexibly and efficiently extend the GraphMLP to the video domain and show that complex temporal dynamics can be effectively modeled in a simple way with negligible computational cost gains in the sequence length. To the best of our knowledge, this is the first MLP-Like architecture for 3D human pose estimation in a single frame and a video sequence. Extensive experiments show that the proposed GraphMLP achieves state-of-the-art performance on two datasets, i.e., Human3.6M and MPI-INF-3DHP. Our source code will be open-sourced.
翻译:现代多层透视模型(MLP)在不自觉地学习视觉表现的学习中显示出了竞争性结果,然而,现有的MLP模型在捕捉当地细节方面并不善于捕捉当地细节,而且缺乏人类身体配置的先前知识,从而限制了其骨骼代表性学习的建模能力。为了解决这些问题,我们建议采用简单而有效的图形强化的MLP-类似结构(名为GreamoMLPP),将MLP和图形革命网络(GCNs)结合到一个全球-地方统一架构中,用于3D人姿势估计。GLP将人体的图形结构纳入MLP模型,以满足3D人姿势的特定领域需求,同时允许本地和全球空间互动。此外,我们提议灵活而有效地将GreaphMLP扩大到视频领域,并表明复杂的时间动态可以以简单的方式有效地建模,在序列长度上可忽略的计算成本收益。据我们所知,这是第一个用于3D人类姿势的MLP-类似架构和视频序列的3DH-MLP结构结构,在单一框框中将显示我们所拟的IMFMLMLMLMDM数据。