In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noise. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our approach begins with an efficient bird's-eye-view (BEV) representation generated from simplex noise, which includes a height field for surface elevation and a semantic field for detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Moreover, we propose a novel generative neural hash grid to parameterize the latent space based on 3D positions and scene semantics, aiming to encode generalizable features across various scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.
翻译:在本文中,我们提出了SceneDreamer,这是一种无条件的生成模型,用于从随机噪声中合成大规模的三维景观。我们的框架仅从野外二维图像集合中学习,并且没有任何三维注释。在SceneDreamer的核心是一个有原则的学习范例,包括1)一个高效而表达丰富的三维场景表示法,2)一种生成式场景参数化和3)一种有效的渲染器,可以利用来自二维图像的知识。我们的方法始于一个由Simplex噪声生成的高效的鸟瞰图(BEV)表示,其中包括了一个表面高度的高度场和一个详细的场景语义的语义场。这个BEV场景表示使得1)可以用二次复杂度表示3D场景,2)具有解缴的几何形状和语义,以及3)高效的训练。此外,我们提出了一种新的生成式神经哈希网格,来基于3D位置和场景语义来参数化潜在空间,旨在对各种场景进行编码匹配。最后,通过对抗训练学习的神经体积渲染器被用来产生逼真的图像。大量实验表明,SceneDreamer的有效性以及在生成生动而多样化的无界三维场景方面,优于现有的各种方法。