We tackle the task of text-to-3D creation with pre-trained latent-based NeRFs (NeRFs that generate 3D objects given input latent code). Recent works such as DreamFusion and Magic3D have shown great success in generating 3D content using NeRFs and text prompts, but the current approach of optimizing a NeRF for every text prompt is 1) extremely time-consuming and 2) often leads to low-resolution outputs. To address these challenges, we propose a novel method named 3D-CLFusion which leverages the pre-trained latent-based NeRFs and performs fast 3D content creation in less than a minute. In particular, we introduce a latent diffusion prior network for learning the w latent from the input CLIP text/image embeddings. This pipeline allows us to produce the w latent without further optimization during inference and the pre-trained NeRF is able to perform multi-view high-resolution 3D synthesis based on the latent. We note that the novelty of our model lies in that we introduce contrastive learning during training the diffusion prior which enables the generation of the valid view-invariant latent code. We demonstrate through experiments the effectiveness of our proposed view-invariant diffusion process for fast text-to-3D creation, e.g., 100 times faster than DreamFusion. We note that our model is able to serve as the role of a plug-and-play tool for text-to-3D with pre-trained NeRFs.
翻译:我们研究了使用预训练基于潜在空间(latent-based)的NeRFs(根据输入潜在代码生成3D对象)进行文本到3D创建的任务。最近的研究如DreamFusion和Magic3D已经展示了在使用NeRF和文本提示生成3D内容方面的巨大成功,但是当前的优化方法对于每个文本提示都非常耗时且往往导致低分辨率输出。为了应对这些挑战,我们提出了一种名为3D-CLFusion的新方法,它利用了预训练基于潜在空间(latent-based)的NeRFs并在不到一分钟的时间内执行快速的3D内容创建。具体而言,我们引入了一个潜在扩散先验网络,用于从CLIP文本/图像嵌入中学习潜在变量w。这个流程允许我们在推理时生成w潜在变量,而预训练的NeRF可以根据潜在变量进行多视图高分辨率3D合成。我们注意到,我们模型的新颖性在于引入了对比学习来训练扩散先验,从而实现了生成有效的视角不变潜在代码。通过实验,我们证明了我们提出的视角不变性扩散过程对于快速的文本到3D创建是有效的,例如,比DreamFusion快100倍。我们注意到,我们的模型能够作为文本到3D预训练NeRF的即插即用工具的角色。