Obtaining 3D object representations is important for creating photo-realistic simulators and collecting assets for AR/VR applications. Neural fields have shown their effectiveness in learning a continuous volumetric representation of a scene from 2D images, but acquiring object representations from these models with weak supervision remains an open challenge. In this paper we introduce LaTeRF, a method for extracting an object of interest from a scene given 2D images of the entire scene and known camera poses, a natural language description of the object, and a small number of point-labels of object and non-object points in the input images. To faithfully extract the object from the scene, LaTeRF extends the NeRF formulation with an additional `objectness' probability at each 3D point. Additionally, we leverage the rich latent space of a pre-trained CLIP model combined with our differentiable object renderer, to inpaint the occluded parts of the object. We demonstrate high-fidelity object extraction on both synthetic and real datasets and justify our design choices through an extensive ablation study.
翻译:获得 3D 对象表征对于创建光现实模拟器和收集AR/VR 应用资产非常重要。神经场域在从 2D 图像中不断学习场景的体积表达方式方面显示了其有效性,但从这些模型中获得的物体表示方式仍然是一个公开的挑战。 在本文中,我们引入了LaTeRF, 一种从一个场景中提取一个感兴趣的对象的方法,给整个场景和已知相机的2D 图像,一种自然语言描述,以及输入图像中物体和非对象点的少量点标记。为了忠实地从场中提取物体, LaTeRF 扩展了NERF 配方,每3D 点都增加了一个“ 瞄准” 概率。 此外,我们利用了经过预先训练的CLIP 模型的丰富潜在空间,连同我们不同的对象投影器,将物体的隐蔽部分插入。我们在合成和真实数据集中展示高纤维物体的提取,并通过广泛的减缩研究来证明我们的设计选择的理由。