Diffusion models currently achieve state-of-the-art performance for both conditional and unconditional image generation. However, so far, image diffusion models do not support tasks required for 3D understanding, such as view-consistent 3D generation or single-view object reconstruction. In this paper, we present RenderDiffusion as the first diffusion model for 3D generation and inference that can be trained using only monocular 2D supervision. At the heart of our method is a novel image denoising architecture that generates and renders an intermediate three-dimensional representation of a scene in each denoising step. This enforces a strong inductive structure into the diffusion process that gives us a 3D consistent representation while only requiring 2D supervision. The resulting 3D representation can be rendered from any viewpoint. We evaluate RenderDiffusion on ShapeNet and Clevr datasets and show competitive performance for generation of 3D scenes and inference of 3D scenes from 2D images. Additionally, our diffusion-based approach allows us to use 2D inpainting to edit 3D scenes. We believe that our work promises to enable full 3D generation at scale when trained on massive image collections, thus circumventing the need to have large-scale 3D model collections for supervision.
翻译:目前,投影模型在有条件和无条件图像生成方面达到最先进的性能。 然而,到目前为止,图像传播模型并不支持3D理解所需的任务,例如视觉兼容的 3D 生成或单一视图天体重建。 在本文中,我们将RenderD扩散作为3D 生成和推论的第一个扩散模型,仅使用单体 2D 监督即可对其进行培训。我们方法的核心是一个新的图像解射结构,它生成并成为每个脱色步骤中场景的中间三维代表。这在3D 扩散过程中强制实施了强大的导导导结构,给我们提供了3D 一致的演示,而仅需要 2D 监督。 由此产生的3D 代表可以从任何角度进行。 我们评估 ShapeNet 和 Clevr 数据集的RenderD 扩散, 并展示3D 图像生成的竞争性性能。 此外,我们基于扩散的方法允许我们使用 2D 来编辑3D 场景的中间三维 。 我们认为, 我们的工作承诺, 当在大规模测试时, 3D 比例 能够使 3D 完全的图像采集 。