Novel view synthesis from a single image requires inferring occluded regions of objects and scenes whilst simultaneously maintaining semantic and physical consistency with the input. Existing approaches condition neural radiance fields (NeRF) on local image features, projecting points to the input image plane, and aggregating 2D features to perform volume rendering. However, under severe occlusion, this projection fails to resolve uncertainty, resulting in blurry renderings that lack details. In this work, we propose NerfDiff, which addresses this issue by distilling the knowledge of a 3D-aware conditional diffusion model (CDM) into NeRF through synthesizing and refining a set of virtual views at test time. We further propose a novel NeRF-guided distillation algorithm that simultaneously generates 3D consistent virtual views from the CDM samples, and finetunes the NeRF based on the improved virtual views. Our approach significantly outperforms existing NeRF-based and geometry-free approaches on challenging datasets, including ShapeNet, ABO, and Clevr3D.
翻译:从单一图像中进行的新视角合成需要推断物体和场景的隐蔽区域,同时保持与输入的语义和物理一致性。现有方法在本地图像特征上条件神经亮度场(NERF),投影指向输入图像平面,并汇总2D特征以进行体积转换。然而,在严重隔离下,这一投影未能解决不确定性,导致模糊不清,从而导致缺乏细节。在这项工作中,我们提议NerfDiff解决这一问题,方法是通过在测试时合成和完善一套虚拟观点,将3D-觉见的有条件扩散模型(CDM)的知识提炼成NERF。我们进一步提议采用新的NERF导蒸馏算法,同时从清洁发展机制样本中生成3D一致的虚拟观点,并根据改进后的虚拟观点对NERF进行微调。我们的方法大大超越了现有的以NRF为基础的无地测量方法,包括ShapeNet、ABO和Clever3D。