Emerging neural radiance fields (NeRF) are a promising scene representation for computer graphics, enabling high-quality 3D reconstruction and novel view synthesis from image observations. However, editing a scene represented by a NeRF is challenging, as the underlying connectionist representations such as MLPs or voxel grids are not object-centric or compositional. In particular, it has been difficult to selectively edit specific regions or objects. In this work, we tackle the problem of semantic scene decomposition of NeRFs to enable query-based local editing of the represented 3D scenes. We propose to distill the knowledge of off-the-shelf, self-supervised 2D image feature extractors such as CLIP-LSeg or DINO into a 3D feature field optimized in parallel to the radiance field. Given a user-specified query of various modalities such as text, an image patch, or a point-and-click selection, 3D feature fields semantically decompose 3D space without the need for re-training and enable us to semantically select and edit regions in the radiance field. Our experiments validate that the distilled feature fields (DFFs) can transfer recent progress in 2D vision and language foundation models to 3D scene representations, enabling convincing 3D segmentation and selective editing of emerging neural graphics representations.
翻译:新兴神经光亮场( NERF ) 是一个很有希望的计算机图形显示场景, 能够进行高质量的 3D 重建和图像观测中的新视图合成。 但是, 将 NERF 所代表的场景编辑成一个场景具有挑战性, 因为像 MLPs 或 voxel 网格这样的基本连接器表象不是以对象为中心或构成为主的。 特别是, 很难有选择地编辑特定区域或对象。 在这项工作中, 我们处理 NERF 的语义场分解问题, 以便能够对代表的 3D 场进行基于查询的本地编辑。 我们提议将 CLIP- LSeg 或 DINO 这样的2D 图像外景场提取成一个3D 功能场, 与光亮的字段平行优化。 由于用户对文本、 图像补丁或点和单击选择选择, 3D 地标域的3D 空间进行基于再培训的本地编辑, 使我们能够在最新图像显示场域中选择和修改区域 3D 。