We present an end-to-end 3D reconstruction method for a scene by directly regressing a truncated signed distance function (TSDF) from a set of posed RGB images. Traditional approaches to 3D reconstruction rely on an intermediate representation of depth maps prior to estimating a full 3D model of a scene. We hypothesize that a direct regression to 3D is more effective. A 2D CNN extracts features from each image independently which are then back-projected and accumulated into a voxel volume using the camera intrinsics and extrinsics. After accumulation, a 3D CNN refines the accumulated features and predicts the TSDF values. Additionally, semantic segmentation of the 3D model is obtained without significant computation. This approach is evaluated on the Scannet dataset where we significantly outperform state-of-the-art baselines (deep multiview stereo followed by traditional TSDF fusion) both quantitatively and qualitatively. We compare our 3D semantic segmentation to prior methods that use a depth sensor since no previous work attempts the problem with only RGB input.
翻译:我们为场景提出了一个端到端三维重建方法,即直接从一组显示的 RGB 图像中退退一个短短的签名远程功能( TSDF ) 。 传统的 3D 重建方法依赖于深度地图中间的表示, 然后再估计一个完整的 3D 模型。 我们假设直接回归到 3D 更有效。 一个 2D CNN 从每张图像中独立提取特征, 然后用相机的内在和外形进行回射并积累成一个 voxel 卷。 在积累后, 一个 3D CNN 改进了所积累的特征并预测了 TSDF 值 。 此外, 3D 模型的语义分割是在没有进行重大计算的情况下获得的。 在扫描网数据集上评估了这种方法, 在那里,我们大大超越了 3D 状态基线( 由传统的 TSDF 聚变异的深度多视立体) 。 我们比较了我们的 3D 语义分割与以前使用深度传感器的方法, 因为以前没有尝试过 RGB 输入 。