While monocular 3D pose estimation seems to have achieved very accurate results on the public datasets, their generalization ability is largely overlooked. In this work, we perform a systematic evaluation of the existing methods and find that they get notably larger errors when tested on different cameras, human poses and appearance. To address the problem, we introduce VirtualPose, a two-stage learning framework to exploit the hidden "free lunch" specific to this task, i.e. generating infinite number of poses and cameras for training models at no cost. To that end, the first stage transforms images to abstract geometry representations (AGR), and then the second maps them to 3D poses. It addresses the generalization issue from two aspects: (1) the first stage can be trained on diverse 2D datasets to reduce the risk of over-fitting to limited appearance; (2) the second stage can be trained on diverse AGR synthesized from a large number of virtual cameras and poses. It outperforms the SOTA methods without using any paired images and 3D poses from the benchmarks, which paves the way for practical applications. Code is available at https://github.com/wkom/VirtualPose.
翻译:在这项工作中,我们对现有方法进行系统评估,发现在不同的照相机、人造面和外观上测试时,这些方法的误差明显较大。为了解决这个问题,我们引入了“虚拟Pose”,这是一个两阶段学习框架,以利用这项任务特有的隐蔽的“免费午餐”,即为培训模型免费生成无限数量的配置和相机。为此,第一阶段将图像转换成抽象的几何表示(AGR),然后将图像绘制为3D。它从两个方面解决了一般化问题:(1) 第一阶段可以培训不同的2D数据集,以减少过度适应有限外观的风险;(2) 第二阶段可以培训从大量虚拟照相机和外观合成的多种AGR;它超越SOTA方法,而不使用任何配对的图像和基准的3D配置,从而为实际应用铺平了道路。代码可在https://github.com/wkom/Virparial查阅。