DITTO-NeRF: 基于扩散的迭代文本到全向3D模型 (DITTO-NeRF: Diffusion-based Iterative Text To Omni-directional 3D Model)

The increasing demand for high-quality 3D content creation has motivated the development of automated methods for creating 3D object models from a single image and/or from a text prompt. However, the reconstructed 3D objects using state-of-the-art image-to-3D methods still exhibit low correspondence to the given image and low multi-view consistency. Recent state-of-the-art text-to-3D methods are also limited, yielding 3D samples with low diversity per prompt with long synthesis time. To address these challenges, we propose DITTO-NeRF, a novel pipeline to generate a high-quality 3D NeRF model from a text prompt or a single image. Our DITTO-NeRF consists of constructing high-quality partial 3D object for limited in-boundary (IB) angles using the given or text-generated 2D image from the frontal view and then iteratively reconstructing the remaining 3D NeRF using inpainting latent diffusion model. We propose progressive 3D object reconstruction schemes in terms of scales (low to high resolution), angles (IB angles initially to outer-boundary (OB) later), and masks (object to background boundary) in our DITTO-NeRF so that high-quality information on IB can be propagated into OB. Our DITTO-NeRF outperforms state-of-the-art methods in terms of fidelity and diversity qualitatively and quantitatively with much faster training times than prior arts on image/text-to-3D such as DreamFusion, and NeuralLift-360.

翻译：高品质三维内容创作的需求日益增长，这促使开发出从单个图像和/或文本提示创建3D对象模型的自动化方法。然而，使用最先进的图像到3D方法重建的3D对象仍然显示出与给定图像的低对应性和低多视角一致性。最近的最先进的文本到3D方法也有局限性，会产生每个提示低多样性的3D样本，并且合成时间很长。为了解决这些挑战，我们提出了DITTO-NeRF，一种从文本提示或单个图像生成高质量3D NeRF模型的新方法。我们的DITTO-NeRF包括使用给定或文本生成的2D图像从正面视角构建高质量的局部3D对象，然后使用填充潜在扩散模型迭代重建剩余的3D NeRF。在DITTO-NeRF中，我们提出了逐步的3D对象重建方案，涉及比例尺（低分辨率到高分辨率）、角度（最初是IB角度到后来的OB）和掩码（对象到背景边界），以使IB的高质量信息能够传播到OB。我们的DITTO-NeRF在保真度和多样性方面在定性和定量方面优于现有的最先进的方法，比如DreamFusion和NeuralLift-360，训练时间也要快得多。