While recent work on text-conditional 3D object generation has shown promising results, the state-of-the-art methods typically require multiple GPU-hours to produce a single sample. This is in stark contrast to state-of-the-art generative image models, which produce samples in a number of seconds or minutes. In this paper, we explore an alternative method for 3D object generation which produces 3D models in only 1-2 minutes on a single GPU. Our method first generates a single synthetic view using a text-to-image diffusion model, and then produces a 3D point cloud using a second diffusion model which conditions on the generated image. While our method still falls short of the state-of-the-art in terms of sample quality, it is one to two orders of magnitude faster to sample from, offering a practical trade-off for some use cases. We release our pre-trained point cloud diffusion models, as well as evaluation code and models, at https://github.com/openai/point-e.
翻译:虽然最近关于文本条件的 3D 对象生成的工作显示了有希望的结果,但最先进的方法通常需要多个 GPU-小时才能生成单一样本。 这与最先进的基因化图像模型形成鲜明对照,这些模型在几秒或几分钟内生成样本。 在本文中,我们探索了一种3D 对象生成的替代方法,该方法在仅仅一个GPU上只产生1-2分钟的3D模型。我们的方法首先使用文本到图像传播模型生成单一合成视图,然后使用生成图像的条件的第二个扩散模型生成一个3D点云。虽然我们的方法在样本质量方面仍然落后于最先进的模型,但比样本更快,为一些使用案例提供一个实际的交换点。我们在https://github.com/opinai/point-e发布了我们经过预先训练的点传播模型以及评估代码和模型。