3D semantic scene completion (SSC) is an ill-posed task that requires inferring a dense 3D scene from incomplete observations. Previous methods either explicitly incorporate 3D geometric input or rely on learnt 3D prior behind monocular RGB images. However, 3D sensors such as LiDAR are expensive and intrusive while monocular cameras face challenges in modeling precise geometry due to the inherent ambiguity. In this work, we propose StereoScene for 3D Semantic Scene Completion (SSC), which explores taking full advantage of light-weight camera inputs without resorting to any external 3D sensors. Our key insight is to leverage stereo matching to resolve geometric ambiguity. To improve its robustness in unmatched areas, we introduce bird's-eye-view (BEV) representation to inspire hallucination ability with rich context information. On top of the stereo and BEV representations, a mutual interactive aggregation (MIA) module is carefully devised to fully unleash their power. Specifically, a Bi-directional Interaction Transformer (BIT) augmented with confidence re-weighting is used to encourage reliable prediction through mutual guidance while a Dual Volume Aggregation (DVA) module is designed to facilitate complementary aggregation. Experimental results on SemanticKITTI demonstrate that the proposed StereoScene outperforms the state-of-the-art camera-based methods by a large margin with a relative improvement of 26.9% in geometry and 38.6% in semantic.
翻译:3D语义场景补全(SSC)是一项需要从不完整的观测中推断出密集的3D场景的不适定任务。以往的方法要么明确地考虑3D几何输入,要么依赖于在单眼RGB图像之后学习的3D先验。然而,3D传感器(如LiDAR)既昂贵又具有侵入性,单眼相机由于固有的模糊性而面临建模准确几何形状的挑战。在本文中,我们提出了StereoScene用于3D语义场景补全(SSC),该算法探索了充分利用轻量级摄像头输入而不需要任何外部3D传感器的方法。我们的关键洞察力是利用立体匹配来解决几何模糊的问题。为了提高其在不匹配区域的鲁棒性,我们引入了鸟瞰图(BEV)表示法,以激发具有丰富上下文信息的幻觉能力。在立体和BEV表示的基础上,我们精心设计了一个相互交互的聚合(MIA)模块,以充分发挥它们的力量。具体而言,我们使用一个双向交互变换器(BIT)来增强置信度重新加权的性能,以鼓励彼此引导来实现可靠预测,同时设计了一个双体积聚合(DVA)模块来促进不同部分的聚合。在SemanticKITTI数据集上的实验结果表明,StereoScene相对于最先进的基于相机的方法,在几何和语义方面的提升分别为26.9%和38.6%。