This paper presents a real-time online vision framework to jointly recover an indoor scene's 3D structure and semantic label. Given noisy depth maps, a camera trajectory, and 2D semantic labels at train time, the proposed neural network learns to fuse the depth over frames with suitable semantic labels in the scene space. Our approach exploits the joint volumetric representation of the depth and semantics in the scene feature space to solve this task. For a compelling online fusion of the semantic labels and geometry in real-time, we introduce an efficient vortex pooling block while dropping the routing network in online depth fusion to preserve high-frequency surface details. We show that the context information provided by the semantics of the scene helps the depth fusion network learn noise-resistant features. Not only that, it helps overcome the shortcomings of the current online depth fusion method in dealing with thin object structures, thickening artifacts, and false surfaces. Experimental evaluation on the Replica dataset shows that our approach can perform depth fusion at 37, 10 frames per second with an average reconstruction F-score of 88%, and 91%, respectively, depending on the depth map resolution. Moreover, our model shows an average IoU score of 0.515 on the ScanNet 3D semantic benchmark leaderboard.
翻译:本文展示了一个实时在线愿景框架, 以联合回收室内场景的 3D 结构和语义标签。 在火车时间, 拟议的神经网络可以将深度与合适的语义标签结合到场景空间的适当的语义标签中。 我们的方法是利用现场空间的深度和语义组合体的体积表达方式来完成这项任务。 对于实时的语义标签和几何学的强制在线融合来说, 我们引入了一个高效的涡流集合区块,同时将路由网络在在线深度中丢弃,以保存高频表面细节。 我们显示, 场景的语义信息有助于深度融合网络学习抗噪声特征。 我们的方法不仅有助于克服当前在处理薄物体结构、 厚度工艺品和虚假表面的在线深度融合方法的缺点。 复制数据集的实验性评估显示, 我们的方法可以以37、 10 框架进行深度融合, 同时将路径网形网络网路网路网的网络路网路网连接起来保存高频度细节。 我们展示了背景背景信息有助于深度网络学习耐噪的特征特征特征特征特征特征特征特征特征。 它显示为8815, 平均深度图显示, 平均深度为深度为深度图的深度为深度, 。