This paper presents a real-time online vision framework to jointly recover an indoor scene's 3D structure and semantic label. Given noisy depth maps, a camera trajectory, and 2D semantic labels at train time, the proposed deep neural network based approach learns to fuse the depth over frames with suitable semantic labels in the scene space. Our approach exploits the joint volumetric representation of the depth and semantics in the scene feature space to solve this task. For a compelling online fusion of the semantic labels and geometry in real-time, we introduce an efficient vortex pooling block while dropping the use of routing network in online depth fusion to preserve high-frequency surface details. We show that the context information provided by the semantics of the scene helps the depth fusion network learn noise-resistant features. Not only that, it helps overcome the shortcomings of the current online depth fusion method in dealing with thin object structures, thickening artifacts, and false surfaces. Experimental evaluation on the Replica dataset shows that our approach can perform depth fusion at 37 and 10 frames per second with an average reconstruction F-score of 88% and 91%, respectively, depending on the depth map resolution. Moreover, our model shows an average IoU score of 0.515 on the ScanNet 3D semantic benchmark leaderboard.
翻译:本文展示了一个实时在线愿景框架, 以联合回收室内场景的 3D 结构和语义标签。 在火车时间, 以吵闹的深度地图、 摄像轨迹和 2D 语义标签为条件, 提议的深神经网络方法学会了将深度与框架结合, 在现场空间使用合适的语义标签。 我们的方法利用现场空间的深度和语义联合体积的体积代表来完成这项任务。 为了在网上将语义标签和实时几何仪进行强制整合, 我们引入了一个高效的旋涡组合块, 同时放弃在在线深度聚合中使用路由网络以保存高频表面细节。 我们展示了现场语义网络提供的背景信息有助于深度融合网络学习防噪的特征。 我们的方法不仅有助于克服当前在线深度融合方法在处理薄物体结构、 厚度工艺板和虚假表面时的缺点。 在Replical数据集的实验性评估中, 我们的方法可以在37和10个纵深点网络进行深度整合, 以保存高频表面的细节细节细节。 我们展示的是, 平均100 的深度 的深度 标准