Due to the lack of camera parameter information for in-the-wild images, existing 3D human pose and shape (HPS) estimation methods make several simplifying assumptions: weak-perspective projection, large constant focal length, and zero camera rotation. These assumptions often do not hold and we show, quantitatively and qualitatively, that they cause errors in the reconstructed 3D shape and pose. To address this, we introduce SPEC, the first in-the-wild 3D HPS method that estimates the perspective camera from a single image and employs this to reconstruct 3D human bodies more accurately. First, we train a neural network to estimate the field of view, camera pitch, and roll given an input image. We employ novel losses that improve the calibration accuracy over previous work. We then train a novel network that concatenates the camera calibration to the image features and uses these together to regress 3D body shape and pose. SPEC is more accurate than the prior art on the standard benchmark (3DPW) as well as two new datasets with more challenging camera views and varying focal lengths. Specifically, we create a new photorealistic synthetic dataset (SPEC-SYN) with ground truth 3D bodies and a novel in-the-wild dataset (SPEC-MTP) with calibration and high-quality reference bodies. Both qualitative and quantitative analysis confirm that knowing camera parameters during inference regresses better human bodies. Code and datasets are available for research purposes at https://spec.is.tue.mpg.de.
翻译:为了解决这个问题,我们引入了SPEC,这是第一个从单一图像中估算视觉相机并用它来更准确地重建3D人体结构的方法。首先,我们培训了一个神经网络,以估计视野、摄像头定位和滚动领域,并给出输入图像。我们采用了新的损失,以提高校准准确度。我们随后从数量上和质量上展示了一个新网络,将相机校准与3D形状和布局相配。为了解决这个问题,我们引入了SPEC,这是第一个从单一图像中估算视觉摄像头并用它来更准确地重建3D人体结构的3D HPS方法。首先,我们培训了一个神经网络,以估计视野、摄像头定位和滚动领域,并给出了一个输入图像图像图像图像图像图像图像图像图像显示器。我们用新的网络将相机校准与图像3DBARBI(S-S-Cral-Cregal ) 数据分析中,我们创建了一个新的图像-CPE-C-C-Cread 数据结构(S-S-Cread silental) 和Grofrial-deal-graphnial 数据分析机构。S-de-deal-deal-deal-deal-deal-deal-degradufal-deal-deal-deal-deal-deal-deal-deal-degradufal-deal 。