We propose Panoptic Lifting, a novel approach for learning panoptic 3D volumetric representations from images of in-the-wild scenes. Once trained, our model can render color images together with 3D-consistent panoptic segmentation from novel viewpoints. Unlike existing approaches which use 3D input directly or indirectly, our method requires only machine-generated 2D panoptic segmentation masks inferred from a pre-trained network. Our core contribution is a panoptic lifting scheme based on a neural field representation that generates a unified and multi-view consistent, 3D panoptic representation of the scene. To account for inconsistencies of 2D instance identifiers across views, we solve a linear assignment with a cost based on the model's current predictions and the machine-generated segmentation masks, thus enabling us to lift 2D instances to 3D in a consistent way. We further propose and ablate contributions that make our method more robust to noisy, machine-generated labels, including test-time augmentations for confidence estimates, segment consistency loss, bounded segmentation fields, and gradient stopping. Experimental results validate our approach on the challenging Hypersim, Replica, and ScanNet datasets, improving by 8.4, 13.8, and 10.6% in scene-level PQ over state of the art.
翻译:我们建议采用全光提升, 这是一种从全景场景的图像中学习全光 3D 体积表达的新方法。 一旦经过培训, 我们的模型可以将彩色图像与3D兼容的全光截面从新观点中产生。 与直接或间接使用 3D 输入的现有方法不同, 我们的方法仅需要机械生成的 2D 光谱截面遮罩, 从预先培训的网络中推断出来。 我们的核心贡献是一个全光提升计划, 其基础是神经外观代表, 产生统一和多视图一致的、 3D 全景场代表场。 为了说明2D 实例识别器在各种视图中的不一致之处, 我们用模型当前预测和机器生成的分光截面遮罩的成本解决线性任务, 从而使我们能够以一致的方式将 2D 例提升到 3D 。 我们进一步提议并更新我们的方法, 以更强的超声响、 机器生成的标签, 包括测试- 时间增强信心估计、 部分一致性损失、 捆绑的分割场景场景和梯停止。 实验结果验证了我们关于具有挑战性的Sysimsim、 Reclaslicalim 和扫描状态的 Q 10- sal- 6 和扫描状态改进了13- AL- salim- sal- sal- sal- saldia- sal- 6 6 6, 和扫描- salvialvialvialvialvial- salvialdal- sal- sal- sal- d.