Transformer architectures have become the model of choice in natural language processing and are now being introduced into computer vision tasks such as image classification, object detection, and semantic segmentation. However, in the field of human pose estimation, convolutional architectures still remain dominant. In this work, we present PoseFormer, a purely transformer-based approach for 3D human pose estimation in videos without convolutional architectures involved. Inspired by recent developments in vision transformers, we design a spatial-temporal transformer structure to comprehensively model the human joint relations within each frame as well as the temporal correlations across frames, then output an accurate 3D human pose of the center frame. We quantitatively and qualitatively evaluate our method on two popular and standard benchmark datasets: Human3.6M and MPI-INF-3DHP. Extensive experiments show that PoseFormer achieves state-of-the-art performance on both datasets. Code is available at \url{https://github.com/zczcwh/PoseFormer}
翻译:在自然语言处理过程中,变形器结构已成为选择自然语言处理的模型,目前正在引入计算机视觉任务,如图像分类、物体探测和语义分割等。然而,在人造图估测领域,变形结构仍然占主导地位。在这项工作中,我们介绍了PoseFormer, 这是一种纯粹以变压器为基础的方法,用于在视频中进行3D人造图估测,而没有涉及变动结构。受视觉变压器最近发展情况的启发,我们设计了一个时空变异器结构,全面模拟每个框架内部的人类联合关系以及跨框架的时间相关性,然后输出一个准确的3D人造图。我们在两个流行和标准的基准数据集:Human3.6M和MPI-INF-3DHP上,我们从数量和质量上评估了我们的方法。广泛的实验显示,PoseFormer在两个数据集上都取得了状态和艺术表现。代码可在\url{http://github.com/zzwh/Posemer}