We propose the integration of top-down and bottom-up approaches to exploit their strengths. Our top-down network estimates human joints from all persons instead of one in an image patch, making it robust to possible erroneous bounding boxes. Our bottom-up network incorporates human-detection based normalized heatmaps, allowing the network to be more robust in handling scale variations. Finally, the estimated 3D poses from the top-down and bottom-up networks are fed into our integration network for final 3D poses. To address the common gaps between training and testing data, we do optimization during the test time, by refining the estimated 3D human poses using high-order temporal constraint, re-projection loss, and bone length regularization. We also introduce a two-person pose discriminator that enforces natural two-person interactions. Finally, we apply a semi-supervised method to overcome the 3D ground-truth data scarcity.
翻译:我们建议整合自上而下和自下而上的方法,以利用其优势。我们的自上而下和自下而上的方法,以开发其优势。我们的自上而下网络估算所有的人的接合点,而不是图像补丁中的接合点,从而对可能错误的捆绑盒进行稳健。我们的自下而上网络包含了基于正常的人体探测色谱,使得网络在处理规模变异时能够更加稳健。最后,自上而下和自下而上网络的估计3D构成将输入我们的一体化网络,用于最终的3D构成。为了解决培训和测试数据之间的共同差距,我们在测试期间通过使用高顺序的时间限制、再投射损耗和骨骼长度规范来完善估计的3D构成,从而优化了3D构成。我们还引入了2人化的显示器,以强制自然的两人互动。最后,我们采用了半监督的方法来克服3D的地壳数据稀缺性。