In monocular video 3D multi-person pose estimation, inter-person occlusion and close interactions can cause human detection to be erroneous and human-joints grouping to be unreliable. Existing top-down methods rely on human detection and thus suffer from these problems. Existing bottom-up methods do not use human detection, but they process all persons at once at the same scale, causing them to be sensitive to multiple-persons scale variations. To address these challenges, we propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. Besides the integration of top-down and bottom-up networks, unlike existing pose discriminators that are designed solely for single person, and consequently cannot assess natural inter-person interactions, we propose a two-person pose discriminator that enforces natural two-person interactions. Lastly, we also apply a semi-supervised method to overcome the 3D ground-truth data scarcity. Our quantitative and qualitative evaluations show the effectiveness of our method compared to the state-of-the-art baselines.
翻译:在单人视频 3D 多人中,单人3D多人构成估计,人际隔离和密切互动可能导致人类检测错误,而人际关联分组不可靠。现有的自上而下方法依赖于人类检测,因此也存在这些问题。现有的自下而上的方法并不使用人类检测方法,而是一次性处理所有人,使他们对多人规模的变化敏感。为了应对这些挑战,我们建议整合自上而下和自下而上的方法,以利用他们的优势。我们的自上而下和自下而上的网络估计所有人而不是一个图像补丁中的人类连接,使其对可能的错误捆绑箱产生强大的影响。我们的自下而上而上的方法取决于人类检测方法,以人类检测方法为基础,基于正常的热测图,使网络在处理规模变化时更加强大。最后,从上而下而上和自下而上而上之网络的估计3D结构将输入到我们的整合网络的最后3D结构。除了整合上自下而上和自下而上而上的网络之外,我们现有的自上而上而上而上而上至上而上而下的网络构成歧视的区分器,与仅仅为单一人设计,因此无法评估自然间互动的自然与人之间互动,我们最后建议采用两种底的基底方法。我们所采用的两种基底方法。我们所使用的方法。我们最后将两种基底方法也显示为我们所使用的一种我们所使用的一种我们所使用的一种我们所使用的一种由上至底的基层数据。