Recent CLIP-guided 3D optimization methods, such as DreamFields and PureCLIPNeRF, have achieved impressive results in zero-shot text-to-3D synthesis. However, due to scratch training and random initialization without prior knowledge, these methods often fail to generate accurate and faithful 3D structures that conform to the input text. In this paper, we make the first attempt to introduce explicit 3D shape priors into the CLIP-guided 3D optimization process. Specifically, we first generate a high-quality 3D shape from the input text in the text-to-shape stage as a 3D shape prior. We then use it as the initialization of a neural radiance field and optimize it with the full prompt. To address the challenging text-to-shape generation task, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between the images synthesized by the text-to-image diffusion model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, Dream3D, is capable of generating imaginative 3D content with superior visual quality and shape accuracy compared to state-of-the-art methods.
翻译:最近,基于CLIP的3D优化方法,如DreamFields和PureCLIPNeRF,在零样本文本到3D合成方面取得了令人印象深刻的成果。然而,由于需要从头开始进行训练和随机初始化,没有先验知识,这些方法经常无法生成符合输入文本的精确和可信的3D结构。在本文中,我们首次尝试将显式的3D形状先验引入到基于CLIP的3D优化过程中。具体而言,我们首先从文本中生成一个高质量的3D形状,作为3D形状先验的文本到形状阶段。然后我们将其用作神经辐射场的初始化,并使用完整提示进行优化。为了解决具有挑战性的文本到形状生成任务,我们提出了一种简单而有效的方法,该方法直接利用强大的文本到图像扩散模型桥接文本和图像模态。为了缩小文本到图像扩散模型合成的图像与图像到形状生成器用于训练的形状渲染之间的风格域差距,我们进一步建议联合优化可学习的文本提示,并微调文本到图像扩散模型以进行渲染样式图像生成。和现有最优秀方法相比,我们的方法Dream3D能够生成具有卓越的视觉质量和形状准确性的想象性3D内容。