We present a diffusion-based model for 3D-aware generative novel view synthesis from as few as a single input image. Our model samples from the distribution of possible renderings consistent with the input and, even in the presence of ambiguity, is capable of rendering diverse and plausible novel views. To achieve this, our method makes use of existing 2D diffusion backbones but, crucially, incorporates geometry priors in the form of a 3D feature volume. This latent feature field captures the distribution over possible scene representations and improves our method's ability to generate view-consistent novel renderings. In addition to generating novel views, our method has the ability to autoregressively synthesize 3D-consistent sequences. We demonstrate state-of-the-art results on synthetic renderings and room-scale scenes; we also show compelling results for challenging, real-world objects.
翻译:我们提出了一种基于扩散模型的生成式新视角合成方法,仅需一个输入图像即可。我们的模型从与输入一致的可能渲染分布中采样,即使在存在歧义的情况下,也能够生成多样而逼真的新视角。为了实现这一目标,我们的方法利用现有的2D扩散神经网络,同时关键性地将三维几何先验纳入到了一个三维特征体积之中。这个潜在的特征场捕捉了可能的场景表示分布,提高了我们的方法生成与现实场景一致的新渲染的能力。除了生成新视角,我们的模型还能够自回归地合成三维一致的序列。我们在合成渲染和房间规模场景方面展示了最先进的结果;我们还展示了针对具有挑战性的真实物体的引人注目的结果。