Recent CLIP-guided 3D optimization methods, e.g., DreamFields and PureCLIPNeRF achieve great success in zero-shot text-guided 3D synthesis. However, due to the scratch training and random initialization without any prior knowledge, these methods usually fail to generate accurate and faithful 3D structures that conform to the corresponding text. In this paper, we make the first attempt to introduce the explicit 3D shape prior to CLIP-guided 3D optimization methods. Specifically, we first generate a high-quality 3D shape from input texts in the text-to-shape stage as the 3D shape prior. We then utilize it as the initialization of a neural radiance field and then optimize it with the full prompt. For the text-to-shape generation, we present a simple yet effective approach that directly bridges the text and image modalities with a powerful text-to-image diffusion model. To narrow the style domain gap between images synthesized by the text-to-image model and shape renderings used to train the image-to-shape generator, we further propose to jointly optimize a learnable text prompt and fine-tune the text-to-image diffusion model for rendering-style image generation. Our method, namely, Dream3D, is capable of generating imaginative 3D content with better visual quality and shape accuracy than state-of-the-art methods.
翻译:最近的 CLIP 引导的 3D 优化方法, 例如, Dream Fields 和 PureCLIPNERF 在零光文本制成3D 合成中取得了巨大成功。 但是, 由于没有事先的任何知识, 这些方法通常无法产生准确和忠实的 3D 结构, 与相应的文本相符。 在本文件中, 我们第一次尝试在 CLIP 引导的 3D 优化方法之前引入明确的 3D 形状。 具体地说, 我们首先从文本到成像的输入文本中产生高质量的 3D 形状, 以3D 形状为形状。 我们然后将它作为神经光线光场的初始化, 并随后以完全迅速的方式优化它。 对于文本到成像的新一代, 我们提出了一个简单而有效的方法, 将文本和图像模式直接连接到强大的文本到图像传播模型。 为了缩小文本到成像的图像生成过程, 我们进一步提议联合优化一个可学习的图像格式的准确度, 3D 生成的图像生成方法 。