Real-time monocular 3D reconstruction is a challenging problem that remains unsolved. Although recent end-to-end methods have demonstrated promising results, tiny structures and geometric boundaries are hardly captured due to their insufficient supervision neglecting spatial details and oversimplified feature fusion ignoring temporal cues. To address the problems, we propose an end-to-end 3D reconstruction network SST, which utilizes Sparse estimated points from visual SLAM system as additional Spatial guidance and fuses Temporal features via a novel cross-modal attention mechanism, achieving more detailed reconstruction results. We propose a Local Spatial-Temporal Fusion module to exploit more informative spatial-temporal cues from multi-view color information and sparse priors, as well a Global Spatial-Temporal Fusion module to refine the local TSDF volumes with the world-frame model from coarse to fine. Extensive experiments on ScanNet and 7-Scenes demonstrate that SST outperforms all state-of-the-art competitors, whilst keeping a high inference speed at 59 FPS, enabling real-world applications with real-time requirements.
翻译:尽管最近的端到端方法显示了令人乐观的结果,但是由于对忽略空间细节和过于简单化的特性融合的监管不够充分,忽略了空间细节,忽略了时间提示,因此很少能够捕捉到微小的结构和几何边界。 为了解决这些问题,我们提议建立一个端到端三维重建网络SST, 利用视觉SLAM系统的零星估计点作为额外的空间指导,并通过一个新的跨时热关注机制将时空特征引信作为额外的空间指导和引信,从而取得更详细的重建结果。 我们提议了一个本地空间-时空扩展模块,以利用多视彩色信息和稀疏的前文中信息中信息中的信息,以及一个全球空间-时空扩展模块,以从粗俗到精细的世界框架模型来完善TSDF的本地数量。关于扫描网和7天文的广泛实验表明,SST超越了所有州竞争者,同时将高推速度保持在59 FPS,使得实时需要的实时应用成为现实世界应用。