Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant. Therefore, much of the recent attention has shifted towards semi and (or) weakly supervised learning. Generating an effective form of supervision with little annotations still poses major challenges in crowded scenes. However, since it is easy to observe a scene from multiple cameras, we propose to impose multi-view geometrical constraints by means of a differentiable triangulation and to use it as form of self-supervision during training when no labels are available. We therefore train a 2D pose estimator in such a way that its predictions correspond to the re-projection of the triangulated 3D one and train an auxiliary network on them to produce the final 3D poses. We complement the triangulation with a weighting mechanism that nullify the impact of noisy predictions caused by self-occlusion or occlusion from other subjects. Our experimental results on Human3.6M and MPI-INF-3DHP substantiate the significance of our weighting strategy where we obtain state-of-the-art results in the semi and weakly supervised learning setup. We also contribute a new multi-player sports dataset that features occlusion, and show the effectiveness of our algorithm over baseline triangulation methods.
翻译:在标签数据丰富的情况下,从单一图像上对3D进行监督估计非常有效。 因此,最近许多注意力已经转向半和(或)监管薄弱的学习。 在拥挤的场景中,产生一种有效的监督形式而没有多少说明,仍然构成重大挑战。然而,由于从多个摄像头观察场景很容易,我们提议通过不同的三角测量来实施多视几何限制,并在没有标签的情况下,在培训期间将其作为自我监督的形式。因此,我们培训了2D构成的估测器,使其预测与三D一的重新预测相吻合,并训练了一个辅助网络来制作三D的最后配置。我们用一个加权机制来补充三角测量,以抵消自我封闭或隔绝其他科目造成的噪音预测的影响。我们对人文的实验结果36.M和MPI-INF-3DHP证实了我们的加权战略的重要性,我们通过这种方式在半和薄弱监督的三D模型中获得了状态的预测结果,我们用三D模型展示了一种超越了我们三维的基线的模型。