Estimating 3D human poses from video is a challenging problem. The lack of 3D human pose annotations is a major obstacle for supervised training and for generalization to unseen datasets. In this work, we address this problem by proposing a weakly-supervised training scheme that does not require 3D annotations or calibrated cameras. The proposed method relies on temporal information and triangulation. Using 2D poses from multiple views as the input, we first estimate the relative camera orientations and then generate 3D poses via triangulation. The triangulation is only applied to the views with high 2D human joint confidence. The generated 3D poses are then used to train a recurrent lifting network (RLN) that estimates 3D poses from 2D poses. We further apply a multi-view re-projection loss to the estimated 3D poses and enforce the 3D poses estimated from multi-views to be consistent. Therefore, our method relaxes the constraints in practice, only multi-view videos are required for training, and is thus convenient for in-the-wild settings. At inference, RLN merely requires single-view videos. The proposed method outperforms previous works on two challenging datasets, Human3.6M and MPI-INF-3DHP. Codes and pretrained models will be publicly available.
翻译:从视频中估算 3D 人造外观是一个棘手的问题。 缺乏 3D 人造外观说明是监督培训和对隐蔽数据集进行概括化的主要障碍。 在这项工作中,我们提出一个不需要 3D 说明或校准相机的薄弱监管培训计划,以解决该问题。 拟议的方法依靠的是时间信息和三角关系。 使用多个视图的2D 显示作为输入, 我们首先估计相对的相机方向, 然后通过三角法生成 3D 3D 。 三角法只适用于具有高 2D 人共同信任度的视图。 生成的 3D 外观配置后, 用于培训一个经常性的提升网络( RLN ), 用于对 2D 显示的外观进行估算。 我们还对估计的 3D 配置和校准的 3D 配置进行多视角再预测, 以便保持一致。 因此, 我们的方法在实践中放松了限制, 培训只需要多视图视频视频, 因而在边缘环境中也比较方便。 推导, RLN 仅需要 单视图 3N 和 MF3. 之前的人类代码 和 MF3. 之前的拟议方法 。