Semantic scene completion (SSC) is a challenging Computer Vision task with many practical applications, from robotics to assistive computing. Its goal is to infer the 3D geometry in a field of view of a scene and the semantic labels of voxels, including occluded regions. In this work, we present SPAwN, a novel lightweight multimodal 3D deep CNN that seamlessly fuses structural data from the depth component of RGB-D images with semantic priors from a bimodal 2D segmentation network. A crucial difficulty in this field is the lack of fully labeled real-world 3D datasets which are large enough to train the current data-hungry deep 3D CNNs. In 2D computer vision tasks, many data augmentation strategies have been proposed to improve the generalization ability of CNNs. However those approaches cannot be directly applied to the RGB-D input and output volume of SSC solutions. In this paper, we introduce the use of a 3D data augmentation strategy that can be applied to multimodal SSC networks. We validate our contributions with a comprehensive and reproducible ablation study. Our solution consistently surpasses previous works with a similar level of complexity.
翻译:语义场景完成( SSC) 是一项具有挑战性的计算机视野任务, 有许多实际应用, 从机器人到辅助计算。 其目标在于将三维几何学从一个场景的视野和Voxels( 包括隐蔽区域) 的语义标签中推导出来。 在这项工作中, 我们提出了新的SPAwN, 这是一种新型轻巧的多式联运3D深线CNN, 它将RGB- D图像的深度部分的结构数据与双式二维分解网络的语义前缀无缝地结合在一起。 这一领域的一个关键困难是缺乏完全贴上标签的真实世界 3D数据集, 这些数据已足够大, 足以培训当前数据饥饿的深度 3D CNN。 在 2D 计算机视野任务中, 我们提出了许多数据增强战略, 以提高CNN的普及能力。 但是, 这些方法不能直接适用于 RGB- D 输入和 SSC 解决方案的输出量。 在本文中, 我们引入了3D 数据增强战略的使用, 可以应用于多式 SSC 网络。 我们用一个全面且可复制的复杂程度的研究来验证我们的贡献。