Recently, transformer-based methods have gained significant success in sequential 2D-to-3D lifting human pose estimation. As a pioneering work, PoseFormer captures spatial relations of human joints in each video frame and human dynamics across frames with cascaded transformer layers and has achieved impressive performance. However, in real scenarios, the performance of PoseFormer and its follow-ups is limited by two factors: (a) The length of the input joint sequence; (b) The quality of 2D joint detection. Existing methods typically apply self-attention to all frames of the input sequence, causing a huge computational burden when the frame number is increased to obtain advanced estimation accuracy, and they are not robust to noise naturally brought by the limited capability of 2D joint detectors. In this paper, we propose PoseFormerV2, which exploits a compact representation of lengthy skeleton sequences in the frequency domain to efficiently scale up the receptive field and boost robustness to noisy 2D joint detection. With minimum modifications to PoseFormer, the proposed method effectively fuses features both in the time domain and frequency domain, enjoying a better speed-accuracy trade-off than its precursor. Extensive experiments on two benchmark datasets (i.e., Human3.6M and MPI-INF-3DHP) demonstrate that the proposed approach significantly outperforms the original PoseFormer and other transformer-based variants. Code is released at \url{https://github.com/QitaoZhao/PoseFormerV2}.
翻译:最近,基于transformer的方法在顺序2D到3D人体姿态估计方面取得了显著的成功。作为开创性的工作,PoseFormer通过级联变换器层捕捉每个视频帧中人体关节之间的空间关系和跨帧的人体动态,并取得了令人印象深刻的性能。然而,在真实场景中,PoseFormer及其后续工作的性能受到两个因素的限制:(a) 输入关节序列的长度;(b) 2D关节检测的质量。现有方法通常将自我注意力应用于输入序列的所有帧,导致当帧数增加以获得先进的估计精度时,计算负担极大,并且它们对2D关节检测的限制性带来的噪声不具有鲁棒性。本文提出了姿态形成器V2,它利用长度骨骼序列在频域中的紧凑表示,以有效地扩展感受野并增强对噪声2D关节检测的鲁棒性。在对PoseFormer进行最小修改的情况下,所提出的方法有效地融合了时域和频域的特征,具有比其前身更好的速度-精度折衷。在两个基准数据集上的大量实验证明了所提出的方法显著优于原始的PoseFormer和其他基于Transformer的变体。代码已发布在 \url{https://github.com/QitaoZhao/PoseFormerV2}。