In this paper, we introduce a novel 3D-aware image generation method that leverages 2D diffusion models. We formulate the 3D-aware image generation task as multiview 2D image set generation, and further to a sequential unconditional-conditional multiview image generation process. This allows us to utilize 2D diffusion models to boost the generative modeling power of the method. Additionally, we incorporate depth information from monocular depth estimators to construct the training data for the conditional diffusion model using only still images. We train our method on a large-scale dataset, i.e., ImageNet, which is not addressed by previous methods. It produces high-quality images that significantly outperform prior methods. Furthermore, our approach showcases its capability to generate instances with large view angles, even though the training images are diverse and unaligned, gathered from "in-the-wild" real-world environments.
翻译:本文提出了一种新颖的三维(3D)感知图像生成方法,它利用了2D扩散模型。我们将3D感知图像生成任务表述为多视角2D图像集生成,并进一步转化为具有无条件条件的顺序多视角图像生成过程。这使我们能够利用2D扩散模型来提升方法的生成建模能力。此外,我们还利用单目深度估计器的深度信息,仅使用静态图像构建条件扩散模型的训练数据。我们在大规模数据集ImageNet上训练我们的方法,这是以前方法没有解决的。它生成高质量图像的能力明显优于以前的方法。此外,我们的方法展示了它生成大视角实例的能力,即使训练图像来自于“野外”真实环境,并且是多样的和不对齐的。