We present a method to map 2D image observations of a scene to a persistent 3D scene representation, enabling novel view synthesis and disentangled representation of the movable and immovable components of the scene. Motivated by the bird's-eye-view (BEV) representation commonly used in vision and robotics, we propose conditional neural groundplans, ground-aligned 2D feature grids, as persistent and memory-efficient scene representations. Our method is trained self-supervised from unlabeled multi-view observations using differentiable rendering, and learns to complete geometry and appearance of occluded regions. In addition, we show that we can leverage multi-view videos at training time to learn to separately reconstruct static and movable components of the scene from a single image at test time. The ability to separately reconstruct movable objects enables a variety of downstream tasks using simple heuristics, such as extraction of object-centric 3D representations, novel view synthesis, instance-level segmentation, 3D bounding box prediction, and scene editing. This highlights the value of neural groundplans as a backbone for efficient 3D scene understanding models.
翻译:我们提出了一种方法,将场景的2D图像观察映射到持久的3D场景表示,实现了新颖的视图合成和可移动和不可移动组件的分离表示。受视觉和机器人技术中常用的鸟瞰图表示的启发,我们提出了有条件的神经地面图,即地面对齐的2D特征网格,作为持久且内存高效的场景表示。我们的方法通过无标签的多视图观察,在可微分渲染的帮助下进行自我监督的训练,并学习完善被遮挡区域的几何和外观。此外,我们展示了可以利用训练时间的多视图视频,从而学习单张图像中分别重构静态和可移动组件的能力。单独重构可移动对象的能力,能够使用简单的启发式方法完成多种下游任务,例如提取对象中心的3D表示、新视角合成、实例级分割、3D边界框预测和场景编辑。这突显了神经地面图作为高效的3D场景理解模型骨干的价值。