Most deep learning approaches to comprehensive semantic modeling of 3D indoor spaces require costly dense annotations in the 3D domain. In this work, we explore a central 3D scene modeling task, namely, semantic scene reconstruction, using a fully self-supervised approach. To this end, we design a trainable model that employs both incomplete 3D reconstructions and their corresponding source RGB-D images, fusing cross-domain features into volumetric embeddings to predict complete 3D geometry, color, and semantics. Our key technical innovation is to leverage differentiable rendering of color and semantics, using the observed RGB images and a generic semantic segmentation model as color and semantics supervision, respectively. We additionally develop a method to synthesize an augmented set of virtual training views complementing the original real captures, enabling more efficient self-supervision for semantics. In this work we propose an end-to-end trainable solution jointly addressing geometry completion, colorization, and semantic mapping from a few RGB-D images, without 3D or 2D ground-truth. Our method is the first, to our knowledge, fully self-supervised method addressing completion and semantic segmentation of real-world 3D scans. It performs comparably well with the 3D supervised baselines, surpasses baselines with 2D supervision on real datasets, and generalizes well to unseen scenes.
翻译:3D 室内空间的全面语义建模的最深层次的学习方法要求 3D 域内 3D 域内昂贵的密集说明。 在这项工作中,我们探索了一个中央 3D 区域建模任务,即使用完全自我监督的方法进行语义场景重建。为此,我们设计了一个可训练的模式,使用不完全的 3D 重建及其相应的源代码 RGB-D 图像,将交叉域域域特性应用到体积嵌入中,以预测完整的 3D 几何、 颜色和语义。我们的关键技术创新是利用观察到的 RGB 图像和通用语义区建模模型,分别作为颜色和语义监督。我们另外还开发了一种方法,将一组强化的虚拟培训观点合成起来,以补充原始的 3D 重建, 使对语义学进行更有效的自我监督。 在这项工作中,我们提出了一个端到端到端可训练的解决方案, 共同解决地理测量的完成、 色化和语义制图, 从少数 RGB D 图像中,不使用 3D 地面或二D 地域的图像进行自我加密的扫描, 我们的方法是第一个, 3 直观的直观的直观的直观的直观的直观、 3 。