Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion, the first diffusion model for 3D generation and inference, trained using only monocular 2D supervision. Central to our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure within the diffusion process, providing a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any view. We evaluate RenderDiffusion on FFHQ, AFHQ, ShapeNet and CLEVR datasets, showing competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes.
翻译:目前,扩散模型在条件和无条件图像生成方面都取得了最先进的性能。然而,到目前为止,图像扩散模型不支持3D理解所需的任务,如视点一致的3D生成或单视图对象重建。在本文中,我们提出了 RenderDiffusion,这是第一个用于3D生成和推理的扩散模型,仅使用单目2D监督进行训练。我们方法的核心是一种新颖的图像去噪架构,它在每个去噪步骤中生成并渲染场景的中间三维表示。这在扩散过程中强制实现了一个强大的归纳结构,提供了一个3D连贯的表示,同时只需要2D监督。所得到的3D表示可以从任意角度进行渲染。我们在FFHQ,AFHQ,ShapeNet和CLEVR数据集上评估了RenderDiffusion,展示了在生成3D场景和从2D图像中推断3D场景方面的竞争性能。此外,我们基于扩散的方法允许我们使用2D修补来编辑3D场景。