We present DropD-SLAM, a real-time monocular SLAM system that achieves RGB-D-level accuracy without relying on depth sensors. The system replaces active depth input with three pretrained vision modules: a monocular metric depth estimator, a learned keypoint detector, and an instance segmentation network. Dynamic objects are suppressed using dilated instance masks, while static keypoints are assigned predicted depth values and backprojected into 3D to form metrically scaled features. These are processed by an unmodified RGB-D SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences, matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS on a single GPU. These results suggest that modern pretrained vision models can replace active depth sensors as reliable, real-time sources of metric scale, marking a step toward simpler and more cost-effective SLAM systems.
翻译:本文提出DropD-SLAM,一种实时单目SLAM系统,在不依赖深度传感器的情况下实现了RGB-D级别的精度。该系统通过三个预训练视觉模块替代主动深度输入:单目度量深度估计器、学习型关键点检测器和实例分割网络。利用膨胀实例掩码抑制动态物体,同时为静态关键点分配预测深度值并反投影至三维空间以形成度量尺度特征。这些特征由未经修改的RGB-D SLAM后端进行处理,用于跟踪与建图。在TUM RGB-D基准测试中,DropD-SLAM在静态序列上达到7.4厘米的平均绝对轨迹误差,在动态序列上达到1.8厘米,其性能匹配或超越了当前最先进的RGB-D方法,同时在单GPU上以22帧/秒的速度运行。这些结果表明,现代预训练视觉模型能够替代主动深度传感器,成为可靠且实时的度量尺度来源,标志着向更简洁、更具成本效益的SLAM系统迈出了重要一步。