In this paper, a method for dense semantic 3D scene reconstruction from an RGB-D sequence is proposed to solve high-level scene understanding tasks. First, each RGB-D pair is consistently segmented into 2D semantic maps based on a camera tracking backbone that propagates objects' labels with high probabilities from full scans to corresponding ones of partial views. Then a dense 3D mesh model of an unknown environment is incrementally generated from the input RGB-D sequence. Benefiting from 2D consistent semantic segments and the 3D model, a novel semantic projection block (SP-Block) is proposed to extract deep feature volumes from 2D segments of different views. Moreover, the semantic volumes are fused into deep volumes from a point cloud encoder to make the final semantic segmentation. Extensive experimental evaluations on public datasets show that our system achieves accurate 3D dense reconstruction and state-of-the-art semantic prediction performances simultaneously.
翻译:在本文中,从 RGB-D 序列中推荐了一种用于从 RGB-D 序列中进行密集的语义 3D 场景重建的方法,以解决高层次的现场理解任务。 首先,每对RGB-D 配对始终分割成2D 语义图,其基柱以摄像头跟踪主干柱为基础,传播从全扫描到相应部分视图的高度概率物体标签。然后,从输入的 RGB-D 序列中逐渐生成了未知环境的密度 3D 网格模型。从 2D 一致的语义片段和 3D 模型中受益的2D 一致语义投影块( SP- Block), 以从不同观点的 2D 段中提取深度的特征片段( SP- Block) 。 此外, 语义卷从点的云层编码器连接成深体, 使最终的语义分解。 对公共数据集的广泛实验性评估显示, 我们的系统能够同时实现准确的 3D 密度重组和状态语义学预测性预报性表现。