Human pose estimation from single images is a challenging problem that is typically solved by supervised learning. Unfortunately, labeled training data does not yet exist for many human activities since 3D annotation requires dedicated motion capture systems. Therefore, we propose an unsupervised approach that learns to predict a 3D human pose from a single image while only being trained with 2D pose data, which can be crowd-sourced and is already widely available. To this end, we estimate the 3D pose that is most likely over random projections, with the likelihood estimated using normalizing flows on 2D poses. While previous work requires strong priors on camera rotations in the training data set, we learn the distribution of camera angles which significantly improves the performance. Another part of our contribution is to stabilize training with normalizing flows on high-dimensional 3D pose data by first projecting the 2D poses to a linear subspace. We outperform the state-of-the-art unsupervised human pose estimation methods on the benchmark datasets Human3.6M and MPI-INF-3DHP in many metrics.
翻译:单个图像中的人类构成估计是一个具有挑战性的问题,通常通过有监督的学习来解决。 不幸的是,标签的培训数据还不存在,因为3D说明需要专门的运动抓捕系统。 因此,我们提议一种未经监督的方法,即学会从单一图像中预测3D人构成,而只接受2D显示数据的培训,这些数据可以是众源的,并且已经广泛提供。为此,我们估计3D构成最有可能超过随机预测,并估计使用2D构成的正常流的可能性。虽然以前的工作要求对成套培训数据中的相机旋转进行有力的前科,但我们学习了显著改进性能的相机角度分布。我们贡献的另一部分是通过将高维3D构成数据流的正常化来稳定培训,首先将2D构成的流投射到一个线形次空间。我们在许多度指标中超越了对基准数据集 Human3.6M 和 MPI-INF-3DHDP 上最先进的、不受监督的人类构成估计方法。