This work proposes a new generation-based 3D reconstruction method, named Cupid, that accurately infers the camera pose, 3D shape, and texture of an object from a single 2D image. Cupid casts 3D reconstruction as a conditional sampling process from a learned distribution of 3D objects, and it jointly generates voxels and pixel-voxel correspondences, enabling robust pose and shape estimation under a unified generative framework. By representing both input camera poses and 3D shape as a distribution in a shared 3D latent space, Cupid adopts a two-stage flow matching pipeline: (1) a coarse stage that produces initial 3D geometry with associated 2D projections for pose recovery; and (2) a refinement stage that integrates pose-aligned image features to enhance structural fidelity and appearance details. Extensive experiments demonstrate Cupid outperforms leading 3D reconstruction methods with an over 3 dB PSNR gain and an over 10% Chamfer Distance reduction, while matching monocular estimators on pose accuracy and delivering superior visual fidelity over baseline 3D generative models. For an immersive view of the 3D results generated by Cupid, please visit cupid3d.github.io.
翻译:本文提出一种名为CUPID的新一代生成式三维重建方法,能够从单张二维图像中精确推断物体的相机姿态、三维形状与纹理。CUPID将三维重建建模为从学习到的三维物体分布中进行条件采样的过程,通过联合生成体素及像素-体素对应关系,在统一的生成式框架下实现鲁棒的姿态与形状估计。通过将输入相机姿态和三维形状共同表示为共享三维潜在空间中的分布,CUPID采用两阶段流匹配流程:(1)粗重建阶段生成初始三维几何结构及其关联的二维投影以恢复姿态;(2)精修阶段整合姿态对齐的图像特征以提升结构保真度与外观细节。大量实验表明,CUPID在峰值信噪比(PSNR)指标上超越主流三维重建方法超过3 dB,倒角距离(Chamfer Distance)降低超过10%,同时在姿态精度方面与单目估计器相当,并在视觉保真度上显著优于基线三维生成模型。如需沉浸式查看CUPID生成的三维结果,请访问cupid3d.github.io。