Human pose estimation from single images is a challenging problem in computer vision that requires large amounts of labeled training data to be solved accurately. Unfortunately, for many human activities (\eg outdoor sports) such training data does not exist and is hard or even impossible to acquire with traditional motion capture systems. We propose a self-supervised approach that learns a single image 3D pose estimator from unlabeled multi-view data. To this end, we exploit multi-view consistency constraints to disentangle the observed 2D pose into the underlying 3D pose and camera rotation. In contrast to most existing methods, we do not require calibrated cameras and can therefore learn from moving cameras. Nevertheless, in the case of a static camera setup, we present an optional extension to include constant relative camera rotations over multiple views into our framework. Key to the success are new, unbiased reconstruction objectives that mix information across views and training samples. The proposed approach is evaluated on two benchmark datasets (Human3.6M and MPII-INF-3DHP) and on the in-the-wild SkiPose dataset.
翻译:从单一图像中估算人类的构成是计算机视觉中一个具有挑战性的问题,需要大量标签的培训数据才能准确解决。 不幸的是,对于许多人类活动(如户外运动)来说,这种培训数据并不存在,而且很难甚至不可能用传统的运动抓捕系统获得。我们提议了一种自我监督的方法,即学习单一图像3D代表从未标签的多视图数据中测出。为此,我们利用多视角一致性限制将观察到的2D构成的立方形与摄像头旋转分解开来。与大多数现有方法不同,我们不需要校准相机,因此可以从移动相机中学习。然而,在固定相机设置的情况下,我们提出一个可选择的扩展,将固定相对的相机旋转纳入我们的框架中。成功的关键是新的、公正的重建目标,将各种观点和培训样本的信息混合在一起。拟议的方法在两个基准数据集(Human3.6M和MPII-INF-3DHP)上进行评价,并在一个边缘的SkiPose数据集上进行评估。