Following the successful application of deep convolutional neural networks to 2d human pose estimation, the next logical problem to solve is 3d human pose estimation from monocular images. While previous solutions have shown some success, they do not fully utilize the depth information from the 2d inputs. With the goal of addressing this depth ambiguity, we build a system that takes 2d joint locations as input along with their estimated depth value and predicts their 3d positions in camera coordinates. Given the inherent noise and inaccuracy from estimating depth maps from monocular images, we perform an extensive statistical analysis showing that given this noise there is still a statistically significant correlation between the predicted depth values and the third coordinate of camera coordinates. We further explain how the state-of-the-art results we achieve on the H3.6M validation set are due to the additional input of depth. Notably, our results are produced on neural network that accepts a low dimensional input and be integrated into a real-time system. Furthermore, our system can be combined with an off-the-shelf 2d pose detector and a depth map predictor to perform 3d pose estimation in the wild.
翻译:在将深层神经神经网络成功应用到2D人形估计后,下一个需要解决的逻辑问题是从单视图像中作出3d人形估计。虽然先前的解决方案已经显示出一定的成功,但它们没有充分利用2D输入的深度信息。为了解决这一深度模糊问题,我们建立了一个系统,将2D联合位置及其估计深度值作为输入,并在摄像坐标中预测其3D位置。鉴于从单视图像中估计深度地图固有的噪音和不准确性,我们进行了广泛的统计分析,表明鉴于这种噪音,预测深度值和摄像坐标第三个坐标之间仍然存在着统计上重要的关联。我们进一步解释H3.6M验证集取得的最新结果是如何由于更多的深度投入而导致的。值得注意的是,我们的结果是在接受低维输入并融入实时系统的神经网络上产生的。此外,我们的系统可以与一个离子2d的外表探测器和一个深度地图预报器结合,以便在野外进行3D的姿势估计。