The graph convolutional networks (GCNs) have been applied to model the physically connected and non-local relations among human joints for 3D human pose estimation (HPE). In addition, the purely Transformer-based models recently show promising results in video-based 3D HPE. However, the single-frame method still needs to model the physically connected relations among joints because the feature representations transformed only by global relations via the Transformer neglect information on the human skeleton. To deal with this problem, we propose a novel method in which the Transformer encoder and GCN blocks are alternately stacked, namely AMPose, to combine the global and physically connected relations among joints towards HPE. In the AMPose, the Transformer encoder is applied to connect each joint with all the other joints, while GCNs are applied to capture information on physically connected relations. The effectiveness of our proposed method is evaluated on the Human3.6M dataset. Our model also shows better generalization ability by testing on the MPI-INF-3DHP dataset. Code can be retrieved at https://github.com/erikervalid/AMPose.
翻译:图形共变网络(GCNs)已被应用于模拟3D人造面估计(HPE)的人类联合体之间的物理连接和非局部关系。此外,纯粹基于变异器的模型最近显示基于视频的3D HPE 3D HPE 的有希望结果。然而,单一框架方法仍然需要模拟各联合体之间的物理连接关系,因为特征显示仅仅通过变异器忽视人类骨骼方面的全球关系而改变。为了解决这一问题,我们提议了一种新颖的方法,使变异器编码器和GCN区块交替堆叠,即AMPose,将与HPE的全球性和有形连接关系结合起来。在AMPose,变异器编码器用于将每个联合体与所有其他联合体连接起来,而GCNs被用于获取与实际连接关系的信息。我们拟议方法的有效性是在人骨骼3.6M数据集上评估的。我们的模型还表明通过测试 MPI-INF-3DHP数据集来更好地概括化能力。代码可以在 https://github.com/erikervalid/AMS。</s>