Diffusion models have emerged as the best approach for generative modeling of 2D images. Part of their success is due to the possibility of training them on millions if not billions of images with a stable learning objective. However, extending these models to 3D remains difficult for two reasons. First, finding a large quantity of 3D training data is much more complex than for 2D images. Second, while it is conceptually trivial to extend the models to operate on 3D rather than 2D grids, the associated cubic growth in memory and compute complexity makes this infeasible. We address the first challenge by introducing a new diffusion setup that can be trained, end-to-end, with only posed 2D images for supervision; and the second challenge by proposing an image formation model that decouples model memory from spatial memory. We evaluate our method on real-world data, using the CO3D dataset which has not been used to train 3D generative models before. We show that our diffusion models are scalable, train robustly, and are competitive in terms of sample quality and fidelity to existing approaches for 3D generative modeling.
翻译:扩散模型已成为生成 2D 图像最好的方法之一。其成功的一部分是由于可以训练数百万甚至数十亿个图像并具有稳定的学习目标。然而,将这些模型扩展到 3D 仍然有两个困难。首先,找到大量的 3D 训练数据比为 2D 图像更为复杂。其次,虽然从概念上讲将模型扩展到在 3D 而不是 2D 网格上运行是微不足道的,但相关的存储和计算复杂性的立方增长使其不可行。我们通过引入一种新的扩散设置来解决第一个挑战,该设置可以以只使用 2D 图像作监督的端到端方式进行训练;通过提出一种图像形成模型来解决第二个挑战,该模型将模型存储从空间存储中分离开来。我们使用 CO3D 数据集评估我们的方法,该数据集以前未用于训练 3D 生成模型。我们展示了我们的扩散模型是可扩展的、可以稳健地训练,其样本质量和现有的 3D 生成建模方法相比保持竞争力且具有高度的保真度。