Human perception reliably identifies movable and immovable parts of 3D scenes, and completes the 3D structure of objects and background from incomplete observations. We learn this skill not via labeled examples, but simply by observing objects move. In this work, we propose an approach that observes unlabeled multi-view videos at training time and learns to map a single image observation of a complex scene, such as a street with cars, to a 3D neural scene representation that is disentangled into movable and immovable parts while plausibly completing its 3D structure. We separately parameterize movable and immovable scene parts via 2D neural ground plans. These ground plans are 2D grids of features aligned with the ground plane that can be locally decoded into 3D neural radiance fields. Our model is trained self-supervised via neural rendering. We demonstrate that the structure inherent to our disentangled 3D representation enables a variety of downstream tasks in street-scale 3D scenes using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance segmentation, and 3D bounding box prediction, highlighting its value as a backbone for data-efficient 3D scene understanding models. This disentanglement further enables scene editing via object manipulation such as deletion, insertion, and rigid-body motion.
翻译:人类感知可靠地识别了3D场景的可移动和不可移动部分,并通过不完整的观测完成了3D天体和背景结构。 我们不是通过标签示例学习这种技能,而只是通过观察天体移动来学习这种技能。 在这项工作中,我们提出一种方法,在培训时间观察未贴标签的多视图视频,在培训时间观察未贴标签的多视图视频,并学习绘制一个对复杂场景的单一图像观测图,如汽车街道,到一个3D场景的3D场图象,它与动产和不可移动部分相分离,同时可以令人相信地完成3D结构。我们通过 2D 神经地面计划将动产和不可移动场景部分分别参数化。这些地面计划是2D 地平面的功能网格,这些功能可以在当地被解码成3D 神经光亮的场景场景场景。 我们证明,我们与3D 相纠缠绕的3D 代表面图象所固有的结构能够让街道的下游任务多样化,例如提取以3D为中心的立体图象、新视觉合成合成合成合成合成、例分解、 3D 三角框框预测,通过神经变动定位的模型,能够使其定位定位定位定位定位定位定位定位定位定位成为数据的定位的定位。