HST Former: 用于 3D 人类豆类估计的等级空间-时空变异器 (HSTFormer: Hierarchical Spatial-Temporal Transformers for 3D Human Pose Estimation)

Transformer-based approaches have been successfully proposed for 3D human pose estimation (HPE) from 2D pose sequence and achieved state-of-the-art (SOTA) performance. However, current SOTAs have difficulties in modeling spatial-temporal correlations of joints at different levels simultaneously. This is due to the poses' spatial-temporal complexity. Poses move at various speeds temporarily with various joints and body-parts movement spatially. Hence, a cookie-cutter transformer is non-adaptable and can hardly meet the "in-the-wild" requirement. To mitigate this issue, we propose Hierarchical Spatial-Temporal transFormers (HSTFormer) to capture multi-level joints' spatial-temporal correlations from local to global gradually for accurate 3D HPE. HSTFormer consists of four transformer encoders (TEs) and a fusion module. To the best of our knowledge, HSTFormer is the first to study hierarchical TEs with multi-level fusion. Extensive experiments on three datasets (i.e., Human3.6M, MPI-INF-3DHP, and HumanEva) demonstrate that HSTFormer achieves competitive and consistent performance on benchmarks with various scales and difficulties. Specifically, it surpasses recent SOTAs on the challenging MPI-INF-3DHP dataset and small-scale HumanEva dataset, with a highly generalized systematic approach. The code is available at: https://github.com/qianxiaoye825/HSTFormer.

翻译：已经成功地为来自 2D 的 3D 人形估计( HPE) 成功提出了基于 3D 的 3D 人形估计法( HPE) 。但是, 目前的 SOTA 很难同时在不同级别模拟联合的空间- 时间相关性。这是由于 3D HPE 空间- 时间复杂性的缘故。 Poses 以不同速度与各种关节和机体间移动进行临时移动。因此, 饼干切口的变压器无法调适, 也很难满足“ 动态” 的要求。为了缓解这一问题, 我们建议高层次的空间- 时空跨形式( HSTFormer) 在不同级别同时建构多层次的联合空间- 时空相关性模型。 3DHPE 。 HSTFormer 由四种变压器 MAD( TE) 和人类- 高层次的 HDFS- 3.6, 人类- 和高层次的人类- 数据(iII ) 和高层次的 SOD- 和高层次的 SODFDF- 和格式数据显示的的和和等数据的的和和格式的 SODFAL- 和格式的的和和和格式的的格式和格式的的 SODFOFS- 的和和和格式格式的的的的的的和的的的和和的和的的和和的的和和的的格式的的的的的的的和和和的和和和的的和的的的的的和和的和和和和的和的的的和的的和和的的的的的和和的和的和的的的和和的和和的