Thanks to the development of 2D keypoint detectors, monocular 3D human pose estimation (HPE) via 2D-to-3D uplifting approaches have achieved remarkable improvements. Still, monocular 3D HPE is a challenging problem due to the inherent depth ambiguities and occlusions. To handle this problem, many previous works exploit temporal information to mitigate such difficulties. However, there are many real-world applications where frame sequences are not accessible. This paper focuses on reconstructing a 3D pose from a single 2D keypoint detection. Rather than exploiting temporal information, we alleviate the depth ambiguity by generating multiple 3D pose candidates which can be mapped to an identical 2D keypoint. We build a novel diffusion-based framework to effectively sample diverse 3D poses from an off-the-shelf 2D detector. By considering the correlation between human joints by replacing the conventional denoising U-Net with graph convolutional network, our approach accomplishes further performance improvements. We evaluate our method on the widely adopted Human3.6M and HumanEva-I datasets. Comprehensive experiments are conducted to prove the efficacy of the proposed method, and they confirm that our model outperforms state-of-the-art multi-hypothesis 3D HPE methods.
翻译:由于开发了2D关键点探测器,通过 2D-3D 提升2D 提升2D-3D 提升方法,单立维人面貌估计(HPE)取得了显著的改进。不过,单立维人面貌估计(HPE)由于内在深度的模糊性和分界性,仍是一个具有挑战性的问题。为了处理这一问题,许多以前的工作利用时间信息来缓解这种困难。然而,有许多现实世界应用程序,其框架序列无法进入。本文的重点是从单一的2D 关键点探测中重建3D 的面貌。我们不是利用时间信息,而是通过生成多个3D 组合候选人来减轻深度的模糊性,这些候选人可以被映射为相同的 2D 关键点。我们建立了一个基于单立的3D HPE 的新的扩散框架,以便从现成的2D 探测器中有效地取样多样性3D 。我们通过用图象变色的网络取代传统的稀释 U-Net,从而实现进一步的绩效改进。我们评估了我们广泛采用的人类3.6M 和HumanEva-I 数据集的方法。我们进行了全面实验,以证明拟议方法的功效,并证实HPEPE- 3PE- 的模型的模型外形方法。