Differently from 2D image datasets such as COCO, large-scale human datasets with 3D ground-truth annotations are very difficult to obtain in the wild. In this paper, we address this problem by augmenting existing 2D datasets with high-quality 3D pose fits. Remarkably, the resulting annotations are sufficient to train from scratch 3D pose regressor networks that outperform the current state-of-the-art on in-the-wild benchmarks such as 3DPW. Additionally, training on our augmented data is straightforward as it does not require to mix multiple and incompatible 2D and 3D datasets or to use complicated network architectures and training procedures. This simplified pipeline affords additional improvements, including injecting extreme crop augmentations to better reconstruct highly truncated people, and incorporating auxiliary inputs to improve 3D pose estimation accuracy. It also reduces the dependency on 3D datasets such as H36M that have restrictive licenses. We also use our method to introduce new benchmarks for the study of real-world challenges such as occlusions, truncations, and rare body poses. In order to obtain such high quality 3D pseudo-annotations, inspired by progress in internal learning, we introduce Exemplar Fine-Tuning (EFT). EFT combines the re-projection accuracy of fitting methods like SMPLify with a 3D pose prior implicitly captured by a pre-trained 3D pose regressor network. We show that EFT produces 3D annotations that result in better downstream performance and are qualitatively preferable in an extensive human-based assessment.
翻译:与 2D 图像数据集(如 COCO ) 不同, 野生很难获得 3D 地面实况说明的大型人类数据集。 在本文中, 我们通过增加现有的 2D 数据集来解决这个问题, 高品质 3D 的外形组合。 显著的是, 由此产生的说明足以从头到尾培训3D 的后退网络, 超过目前最新的标准( 如 3DPW ) 。 此外, 我们的扩大数据培训是直截了当的, 因为它不需要混合多个且不兼容的 2D 和 3D 的地面实况说明, 或使用复杂的网络架构和培训程序。 这种简化的管道可以带来更多的改进, 包括注射极端作物增强功能, 以更好地重建高度疏漏的人, 并纳入辅助性投入, 提高 3D 的准确度。 这还降低了对3D 3D 级数据库(如H36M ) 等具有限制性许可证的当前最新标准的依赖性。 我们还使用我们的方法为研究真实世界挑战引入新的基准, 例如 隐含性、 3D 和稀有深度的深度数据评估, 3D 和精度的深度网络 。