Monocular 3D human pose estimation has made progress in recent years. Most of the methods focus on single persons, which estimate the poses in the person-centric coordinates, i.e., the coordinates based on the center of the target person. Hence, these methods are inapplicable for multi-person 3D pose estimation, where the absolute coordinates (e.g., the camera coordinates) are required. Moreover, multi-person pose estimation is more challenging than single pose estimation, due to inter-person occlusion and close human interactions. Existing top-down multi-person methods rely on human detection (i.e., top-down approach), and thus suffer from the detection errors and cannot produce reliable pose estimation in multi-person scenes. Meanwhile, existing bottom-up methods that do not use human detection are not affected by detection errors, but since they process all persons in a scene at once, they are prone to errors, particularly for persons in small scales. To address all these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularizations. Our evaluations demonstrate the effectiveness of the proposed method. Code and models are available: https://github.com/3dpose/3D-Multi-Person-Pose.
翻译:近些年来,对3D人的绝对值的估算有所进展。 大部分方法侧重于单人, 即以人为中心的坐标, 即以目标人为中心的坐标。 因此, 这些方法不适用于多人 3D 的估算, 需要绝对坐标( 如相机坐标) 。 此外, 由于人际隔离和密切的人际互动, 多人构成的估算比单人构成的估算更具挑战性。 现有的自上而下多人方法依赖于人类检测( 即自上而下的方法 ), 并因此受到检测时间/3 的误差, 因而无法在多人的场景中产生可靠的配置估计。 与此同时, 现有的不使用人性检测的自下而上而上的方法, 但是由于他们曾经在现场处理过所有人, 特别是小规模的人, 也容易出现错误。 为了应对所有这些挑战, 我们建议自上而下和自下而上而上而上的方法, 利用所有的人的自上而下的网络评估, 而不是在基于数据型的D 3D 的网络中, 显示自上而上而上至上而下的测试速度的, 度的测试, 数据在最终的网络中, 将数据转换到我们的升级的升级的测试变变变。