FoundationSLAM：释放深度基础模型在全端密集视觉SLAM中的潜力 (FoundationSLAM: Unleashing the Power of Depth Foundation Models for End-to-End Dense Visual SLAM)

We present FoundationSLAM, a learning-based monocular dense SLAM system that addresses the absence of geometric consistency in previous flow-based approaches for accurate and robust tracking and mapping. Our core idea is to bridge flow estimation with geometric reasoning by leveraging the guidance from foundation depth models. To this end, we first develop a Hybrid Flow Network that produces geometry-aware correspondences, enabling consistent depth and pose inference across diverse keyframes. To enforce global consistency, we propose a Bi-Consistent Bundle Adjustment Layer that jointly optimizes keyframe pose and depth under multi-view constraints. Furthermore, we introduce a Reliability-Aware Refinement mechanism that dynamically adapts the flow update process by distinguishing between reliable and uncertain regions, forming a closed feedback loop between matching and optimization. Extensive experiments demonstrate that FoundationSLAM achieves superior trajectory accuracy and dense reconstruction quality across multiple challenging datasets, while running in real-time at 18 FPS, demonstrating strong generalization to various scenarios and practical applicability of our method.

翻译：我们提出了FoundationSLAM，一种基于学习的单目密集SLAM系统，旨在解决以往基于光流的方法中几何一致性的缺失，以实现精确且鲁棒的跟踪与建图。我们的核心思想是通过利用基础深度模型的引导，将光流估计与几何推理相融合。为此，我们首先开发了一种混合光流网络，该网络能够生成具有几何感知的对应关系，从而在不同关键帧之间实现一致的深度与姿态推断。为了确保全局一致性，我们提出了一种双一致束调整层，该层在多视图约束下联合优化关键帧姿态与深度。此外，我们引入了一种可靠性感知的细化机制，通过区分可靠区域与不确定区域来自适应地调整光流更新过程，从而在匹配与优化之间形成一个闭环反馈。大量实验表明，FoundationSLAM在多个具有挑战性的数据集上均实现了卓越的轨迹精度与密集重建质量，同时能以18 FPS的速度实时运行，这证明了我们的方法在各种场景下具有强大的泛化能力与实际应用价值。