Most pipelines for Augmented and Virtual Reality estimate the ego-motion of the camera by creating a map of sparse 3D landmarks. In this paper, we tackle the problem of depth completion, that is, densifying this sparse 3D map using RGB images as guidance. This remains a challenging problem due to the low density, non-uniform and outlier-prone 3D landmarks produced by SfM and SLAM pipelines. We introduce a transformer block, SparseFormer, that fuses 3D landmarks with deep visual features to produce dense depth. The SparseFormer has a global receptive field, making the module especially effective for depth completion with low-density and non-uniform landmarks. To address the issue of depth outliers among the 3D landmarks, we introduce a trainable refinement module that filters outliers through attention between the sparse landmarks.
翻译:多数用于增强和虚拟现实的管道通过绘制一个稀有的 3D 地标地图来估计摄像头的自我感官。 在本文中,我们解决了深度完成问题,即利用 RGB 图像来压缩这幅稀薄的 3D 地图作为指导。由于SfM 和 SLAM 的管道产生的密度低、不统一和异常易变的 3D 地标问题,这仍然是一个具有挑战性的问题。我们引入了一个变压器块,SparseFormer, 将具有深视特征的3D 地标连接起来,以产生密密密的深度。 Sprass Former 拥有一个全球可容纳的场, 使该模块特别有效地以低密度和非统一地标来完成深度完成深度。 为了解决3D 地标之间的深度偏差问题,我们引入了一个可训练的改进模块,通过细化的地标之间的注意来过滤外缘。