Recent transformer-based solutions have been introduced to estimate 3D human pose from 2D keypoint sequence by considering body joints among all frames globally to learn spatio-temporal correlation. We observe that the motions of different joints differ significantly. However, the previous methods cannot efficiently model the solid inter-frame correspondence of each joint, leading to insufficient learning of spatial-temporal correlation. We propose MixSTE (Mixed Spatio-Temporal Encoder), which has a temporal transformer block to separately model the temporal motion of each joint and a spatial transformer block to learn inter-joint spatial correlation. These two blocks are utilized alternately to obtain better spatio-temporal feature encoding. In addition, the network output is extended from the central frame to entire frames of the input video, thereby improving the coherence between the input and output sequences. Extensive experiments are conducted on three benchmarks (Human3.6M, MPI-INF-3DHP, and HumanEva). The results show that our model outperforms the state-of-the-art approach by 10.9% P-MPJPE and 7.6% MPJPE. The code is available at https://github.com/JinluZhang1126/MixSTE.
翻译:最近采用了基于变压器的解决方案,从 2D 关键点序列中估算 3D 人类的外貌,方法是考虑全球各框架之间的身体组合,以学习时空关系。我们观察到,不同组合的动作差异很大。然而,以往的方法无法有效地模拟每个组合的坚实的跨框架对应,导致对空间时空关系缺乏足够的了解。我们提议MixSTE(混合Spatio-Temporal Encoder),它有一个时间变压块,可以分别模拟每个联合和空间变压块的时间运动,以学习相互间空间关系。这两个块被轮流使用,以获取更好的时空特征编码。此外,网络输出从中央框架扩大到输入视频的整个框架,从而改进输入和输出序列之间的连贯性。我们在三个基准(HR3.6M、MPI-INF-3DHP和HumanEVa)上进行了广泛的实验。结果显示,我们的模型在10.9% P-MPJPEPE 和 7.6% MPJPE 和 7.6% MPJPE) MPJPA 和 7.6% MPJPE MPJPIS 上可以找到。