In this work, we present SceneDreamer, an unconditional generative model for unbounded 3D scenes, which synthesizes large-scale 3D landscapes from random noises. Our framework is learned from in-the-wild 2D image collections only, without any 3D annotations. At the core of SceneDreamer is a principled learning paradigm comprising 1) an efficient yet expressive 3D scene representation, 2) a generative scene parameterization, and 3) an effective renderer that can leverage the knowledge from 2D images. Our framework starts from an efficient bird's-eye-view (BEV) representation generated from simplex noise, which consists of a height field and a semantic field. The height field represents the surface elevation of 3D scenes, while the semantic field provides detailed scene semantics. This BEV scene representation enables 1) representing a 3D scene with quadratic complexity, 2) disentangled geometry and semantics, and 3) efficient training. Furthermore, we propose a novel generative neural hash grid to parameterize the latent space given 3D positions and the scene semantics, which aims to encode generalizable features across scenes. Lastly, a neural volumetric renderer, learned from 2D image collections through adversarial training, is employed to produce photorealistic images. Extensive experiments demonstrate the effectiveness of SceneDreamer and superiority over state-of-the-art methods in generating vivid yet diverse unbounded 3D worlds.
翻译:在这项工作中,我们展示了SceenDreamer, 一种无条件的3D场景的SceenDreamer, 一种无条件的3D场景的基因模型,它综合了来自随机噪音的大规模3D场景。我们的框架仅从在2D图像的光碟中学习,没有3D说明。在SceneDreamer的核心是一个有原则的学习范例,包括:(1) 一个高效的3D场景表情,(2) 一个基因化的场景参数,和(3) 一个能够利用 2D 图像知识的有效制作者。我们的框架来自一个高效的鸟视景(BEV) 代表, 它来自一个高空场和一个语义字段。 高度字段代表着3D场的表面升迁, 而语义化场则提供详细的场景色描述。 BeenDeveloper 演示能够使3D场景代表一个具有四维复杂度的3D场景,(2) 分解的地理测量和语义化, 和语义化培训。 此外,我们提议一种新型的变形神经网格网格, 将3D图像的图像从3D 生成成一个普通的图像的图像, 制作成一个普通的图案集, 制作成一个普通的图象, 制作成一个普通图象, 制作成图。