Automatically generating high-quality real world 3D scenes is of enormous interest for applications such as virtual reality and robotics simulation. Towards this goal, we introduce NeuralField-LDM, a generative model capable of synthesizing complex 3D environments. We leverage Latent Diffusion Models that have been successfully utilized for efficient high-quality 2D content creation. We first train a scene auto-encoder to express a set of image and pose pairs as a neural field, represented as density and feature voxel grids that can be projected to produce novel views of the scene. To further compress this representation, we train a latent-autoencoder that maps the voxel grids to a set of latent representations. A hierarchical diffusion model is then fit to the latents to complete the scene generation pipeline. We achieve a substantial improvement over existing state-of-the-art scene generation models. Additionally, we show how NeuralField-LDM can be used for a variety of 3D content creation applications, including conditional scene generation, scene inpainting and scene style manipulation.
翻译:自动生成高质量的现实世界3D场景对于虚拟现实和机器人模拟等应用具有极大的兴趣。为了实现这一目标,我们引入了神经场-LDM,一种可以综合复杂3D环境的生成模型。我们利用了已成功用于高质量2D内容创建的潜在扩散模型。我们首先训练一个场景自编码器,将一组图像和姿态对表示为神经场,表示为密度和特征体素网格,可以投影以生成场景的新视图。为了进一步压缩这个表示,我们训练了一个潜在自编码器,将体素网格映射到一组潜在表示。然后,我们用层次扩散模型拟合这些潜在要素,以完成场景生成管道。我们取得了比现有最先进的场景生成模型更大的改进。此外,我们展示了神经场-LDM如何用于各种3D内容创建应用,包括条件场景生成、场景修补和场景风格操作。