3D-aware image synthesis encompasses a variety of tasks, such as scene generation and novel view synthesis from images. Despite numerous task-specific methods, developing a comprehensive model remains challenging. In this paper, we present SSDNeRF, a unified approach that employs an expressive diffusion model to learn a generalizable prior of neural radiance fields (NeRF) from multi-view images of diverse objects. Previous studies have used two-stage approaches that rely on pretrained NeRFs as real data to train diffusion models. In contrast, we propose a new single-stage training paradigm with an end-to-end objective that jointly optimizes a NeRF auto-decoder and a latent diffusion model, enabling simultaneous 3D reconstruction and prior learning, even from sparsely available views. At test time, we can directly sample the diffusion prior for unconditional generation, or combine it with arbitrary observations of unseen objects for NeRF reconstruction. SSDNeRF demonstrates robust results comparable to or better than leading task-specific methods in unconditional generation and single/sparse-view 3D reconstruction.
翻译:3D 感知图像合成包括各种任务,如场景生成和从图像合成新视角。尽管有大量特定任务的方法,但开发综合模型仍然具有挑战性。在本文中,我们提出了 SSDNeRF,一种统一的方法,采用表达力强的扩散模型,从多视角图像中学习神经辐射场(NeRF)的可推广先验知识。以往的研究采用了依赖预训练 NeRF 的真实数据来训练扩散模型的双阶段方法。相反,我们提出了一个新的单阶段训练范例,采用端到端的目标,联合优化 NeRF 自动解码器和潜在扩散模型,实现同时进行 3D 重建和先验学习,即使只能使用少量视角。在测试时,我们可以直接采样扩散先验进行无条件生成,或将其与未见过物体的任意观察组合用于 NeRF 重建。SSDNeRF 在无条件生成、单视角/稀疏视角 3D 重建方面展示了与领先任务特定方法相当或更好的稳健结果。