The state-of-the-art for monocular 3D human pose estimation in videos is dominated by the paradigm of 2D-to-3D pose uplifting. While the uplifting methods themselves are rather efficient, the true computational complexity depends on the per-frame 2D pose estimation. In this paper, we present a Transformer-based pose uplifting scheme that can operate on temporally sparse 2D pose sequences but still produce temporally dense 3D pose estimates. We show how masked token modeling can be utilized for temporal upsampling within Transformer blocks. This allows to decouple the sampling rate of input 2D poses and the target frame rate of the video and drastically decreases the total computational complexity. Additionally, we explore the option of pre-training on large motion capture archives, which has been largely neglected so far. We evaluate our method on two popular benchmark datasets: Human3.6M and MPI-INF-3DHP. With an MPJPE of 45.0 mm and 46.9 mm, respectively, our proposed method can compete with the state-of-the-art while reducing inference time by a factor of 12. This enables real-time throughput with variable consumer hardware in stationary and mobile applications. We release our code and models at https://github.com/goldbricklemon/uplift-upsample-3dhpe
翻译:视频中单眼 3D 人造外形估计的状态艺术由 2D 到 3D 的范式构成提升。 虽然提升方法本身相当有效, 真正的计算复杂性取决于每个框架 2D 构成估计。 在本文中, 我们展示一个基于变压器的提升方案, 可以在暂时稀释的 2D 构成序列上操作, 但仍产生时间密集的 3D 构成估计。 我们展示了如何在变压器区内使用掩码象征性标志模型进行时间抽查。 这样可以调出输入 2D 配置的抽样率和视频的目标框架率, 并大幅降低计算复杂性。 此外, 我们探索了大型运动抓取档案的预培训选项, 这个选项迄今为止基本上被忽视了。 我们用两个流行的基准数据集评估了我们的方法: Human3. 360M 和 MPI- INF-3DHDHP。 我们提出的方法分别是45. 0 mm 和 46.9 毫米的MPJPE, 与状态图示模型进行竞争, 同时减少输入时间, 并大幅降低视频 3 3 复杂 。 此外 12, 我们的移动模型可以让实际的 数字 。