Supervised approaches to 3D pose estimation from single images are remarkably effective when labeled data is abundant. However, as the acquisition of ground-truth 3D labels is labor intensive and time consuming, recent attention has shifted towards semi- and weakly-supervised learning. Generating an effective form of supervision with little annotations still poses major challenge in crowded scenes. In this paper we propose to impose multi-view geometrical constraints by means of a weighted differentiable triangulation and use it as a form of self-supervision when no labels are available. We therefore train a 2D pose estimator in such a way that its predictions correspond to the re-projection of the triangulated 3D pose and train an auxiliary network on them to produce the final 3D poses. We complement the triangulation with a weighting mechanism that alleviates the impact of noisy predictions caused by self-occlusion or occlusion from other subjects. We demonstrate the effectiveness of our semi-supervised approach on Human3.6M and MPI-INF-3DHP datasets, as well as on a new multi-view multi-person dataset that features occlusion.
翻译:在标签数据丰富的情况下,从单一图像中对 3D 进行监督估计非常有效。 然而,由于获取地面真相 3D 标签是劳动密集和耗时的,最近注意力已转向半和薄弱的受监督学习。在拥挤的场景中,产生一种有效的监督形式而没有多少说明仍然构成重大挑战。在本文中,我们提议通过加权可区分的三角测量来施加多视几何限制,并在没有标签时将其作为自我监督的一种观察形式。因此,我们训练了2D 显示显示的显示器,其预测与三角3D 显示的重新预测相对应,并训练了一个辅助网络来制作最终的3D。我们用一个加权机制来补充三角测量,以减轻自我隔离或与其他主题隔离造成的噪音预测的影响。我们展示了我们对人文3.6M 和 MPI-INF-3DHP 数据集的半监督方法的有效性,以及一个新的多视图多位数据集的功效。