2D-to-3D reconstruction is an ill-posed problem, yet humans are good at solving this problem due to their prior knowledge of the 3D world developed over years. Driven by this observation, we propose NeRDi, a single-view NeRF synthesis framework with general image priors from 2D diffusion models. Formulating single-view reconstruction as an image-conditioned 3D generation problem, we optimize the NeRF representations by minimizing a diffusion loss on its arbitrary view renderings with a pretrained image diffusion model under the input-view constraint. We leverage off-the-shelf vision-language models and introduce a two-section language guidance as conditioning inputs to the diffusion model. This is essentially helpful for improving multiview content coherence as it narrows down the general image prior conditioned on the semantic and visual features of the single-view input image. Additionally, we introduce a geometric loss based on estimated depth maps to regularize the underlying 3D geometry of the NeRF. Experimental results on the DTU MVS dataset show that our method can synthesize novel views with higher quality even compared to existing methods trained on this dataset. We also demonstrate our generalizability in zero-shot NeRF synthesis for in-the-wild images.
翻译:2D 到 3D 重建是一个错误的问题, 然而人类之所以能很好地解决这个问题, 是因为他们以前对3D 多年来形成的世界有了解。 受此观察的驱使, 我们提议 Nerdi, 是一个单一视图 NERF 合成框架, 具有2D 扩散模型的一般图像前缀。 将单视图重建设计作为图像附加3D 生成问题, 我们优化 NERF 表示方式, 在输入- 视野限制下, 以预先训练的图像传播模型来尽量减少其任意视图的传播损失。 我们利用现成的视觉语言模型, 并引入两部分语言指导, 作为对传播模型的调控投入。 这对改进多视图内容的一致性非常有用, 因为它缩小了先前以 2D 扩散模型的语义和视觉特征为条件的一般图像。 此外, 我们引入了基于估计深度地图的地理测量损失, 以规范 NRF 的基本 3D 地理测量方法。 DTU MVS 数据集的实验结果显示, 我们的方法可以将新观点和更高质量加以综合, 即使与在这种数据集中训练的现有方法相比, 我们在零RF 中也展示了通用的图像。