Reconstructing the 3D shape of an object from a single RGB image is a long-standing and highly challenging problem in computer vision. In this paper, we propose a novel method for single-image 3D reconstruction which generates a sparse point cloud via a conditional denoising diffusion process. Our method takes as input a single RGB image along with its camera pose and gradually denoises a set of 3D points, whose positions are initially sampled randomly from a three-dimensional Gaussian distribution, into the shape of an object. The key to our method is a geometrically-consistent conditioning process which we call projection conditioning: at each step in the diffusion process, we project local image features onto the partially-denoised point cloud from the given camera pose. This projection conditioning process enables us to generate high-resolution sparse geometries that are well-aligned with the input image, and can additionally be used to predict point colors after shape reconstruction. Moreover, due to the probabilistic nature of the diffusion process, our method is naturally capable of generating multiple different shapes consistent with a single input image. In contrast to prior work, our approach not only performs well on synthetic benchmarks, but also gives large qualitative improvements on complex real-world data.
翻译:从一个 RGB 图像中重建一个对象的 3D 形状是一个长期且极具挑战性的计算机视觉问题。 在本文中, 我们提出一个新颖的单一图像 3D 重建方法, 通过一个有条件的分解扩散进程生成一个稀有点云。 我们的方法将一个单一 RGB 图像及其摄像头作为输入输入, 并逐渐将一组 3D 点及其位置最初从三维的高空分布中随机抽取到一个对象的形状。 我们的方法的关键是一个几何一致的调制程序, 我们称之为投影调节: 在扩散进程的每一个步骤中, 我们从给定的摄像头的半偏差点云上投放本地图像特征。 这个投影调节程序使我们能够生成高分辨率的稀疏色, 与输入图像完全吻合, 并且可以额外地用于在形状重建后预测点的颜色。 此外, 由于扩散过程的概率性, 我们的方法自然能够生成多个不同的形状, 与单一输入图像相匹配。 在先前的工作上, 我们的合成方法也只能进行大量的质量改进 。