3D semantic scene completion (SSC) is an ill-posed task that requires inferring a dense 3D scene from incomplete observations. Previous methods either explicitly incorporate 3D geometric input or rely on learnt 3D prior behind monocular RGB images. However, 3D sensors such as LiDAR are expensive and intrusive while monocular cameras face challenges in modeling precise geometry due to the inherent ambiguity. In this work, we propose StereoScene for 3D Semantic Scene Completion (SSC), which explores taking full advantage of light-weight camera inputs without resorting to any external 3D sensors. Our key insight is to leverage stereo matching to resolve geometric ambiguity. To improve its robustness in unmatched areas, we introduce bird's-eye-view (BEV) representation to inspire hallucination ability with rich context information. On top of the stereo and BEV representations, a mutual interactive aggregation (MIA) module is carefully devised to fully unleash their power. Specifically, a Bi-directional Interaction Transformer (BIT) augmented with confidence re-weighting is used to encourage reliable prediction through mutual guidance while a Dual Volume Aggregation (DVA) module is designed to facilitate complementary aggregation. Experimental results on SemanticKITTI demonstrate that the proposed StereoScene outperforms the state-of-the-art camera-based methods by a large margin with a relative improvement of 26.9% in geometry and 38.6% in semantic.
翻译:3D 语义场景补全(SSC)是一个需要从不完全的观测中推断出稠密 3D 场景的不适定问题。以往的方法要么明确地包含 3D 几何输入,要么依赖于在单眼 RGB 图像背后学习的 3D 先验。然而,3D 传感器(如 LiDAR)昂贵且具有侵入性,而单眼摄像头由于固有的模糊性面临模建精确几何的挑战。在本文中,我们提出了 StereoScene 用于 3D 语义场景补全(SSC)。该方法通过充分利用轻量级相机输入,而不需要使用任何外部 3D 传感器。我们的关键见解是利用立体匹配来解决几何模糊性。为了提高其在未匹配区域中的稳健性,我们引入了鸟瞰图(BEV)表示法,以激发具有丰富环境信息的虚构能力。在立体和 BEV 表示之上,谨慎设计了一个相互交互聚合(MIA)模块,充分释放了它们的能力。具体而言,使用双向交互变换器(BIT)增强置信度重新加权,以鼓励通过相互引导来可靠地预测,同时设计了一个双体积聚合(DVA)模块,以促进互补聚合。在 SemanticKITTI 上的实验结果表明,StereoScene 在几何和语义上相对于基于相机的现有方法提高了26.9%和38.6%的性能。