Estimation of the human pose from a monocular camera has been an emerging research topic in the computer vision community with many applications. Recently, benefited from the deep learning technologies, a significant amount of research efforts have greatly advanced the monocular human pose estimation both in 2D and 3D areas. Although there have been some works to summarize the different approaches, it still remains challenging for researchers to have an in-depth view of how these approaches work. In this paper, we provide a comprehensive and holistic 2D-to-3D perspective to tackle this problem. We categorize the mainstream and milestone approaches since the year 2014 under unified frameworks. By systematically summarizing the differences and connections between these approaches, we further analyze the solutions for challenging cases, such as the lack of data, the inherent ambiguity between 2D and 3D, and the complex multi-person scenarios. We also summarize the pose representation styles, benchmarks, evaluation metrics, and the quantitative performance of popular approaches. Finally, we discuss the challenges and give deep thinking of promising directions for future research. We believe this survey will provide the readers with a deep and insightful understanding of monocular human pose estimation.
翻译:在计算机视觉界,利用多种应用的深层学习技术,最近,大量研究工作大大推进了2D和3D领域的单人面貌估计。虽然已经做了一些工作来总结不同的方法,但研究人员仍难以深入了解这些方法如何发挥作用。在本文件中,我们为解决这一问题提供了一个全面和整体的2D至3D视角。我们将2014年以来的主流和里程碑式方法分类为统一框架。通过系统总结这些方法之间的差异和联系,我们进一步分析具有挑战性的案例的解决方案,如缺乏数据、2D和3D之间固有的模糊性以及复杂的多人情景。我们还总结了各种代表式、基准、评价尺度以及流行方法的定量表现。最后,我们讨论了未来研究的挑战,并深刻思考有希望的方向。我们认为,这次调查将为读者提供对单人面面面面面面面面的深刻和深刻理解。