Recent advances in 3D Gaussian Splatting (3DGS) have enabled Simultaneous Localization and Mapping (SLAM) systems to build photorealistic maps. However, these maps lack the open-vocabulary semantic understanding required for advanced robotic interaction. Integrating language features into SLAM remains a significant challenge, as storing high-dimensional features demands excessive memory and rendering overhead, while existing methods with static models lack adaptability for novel environments. To address these limitations, we propose LEGO-SLAM (Language-Embedded Gaussian Optimization SLAM), the first framework to achieve real-time, open-vocabulary mapping within a 3DGS-based SLAM system. At the core of our method is a scene-adaptive encoder-decoder that distills high-dimensional language embeddings into a compact 16-dimensional feature space. This design reduces the memory per Gaussian and accelerates rendering, enabling real-time performance. Unlike static approaches, our encoder adapts online to unseen scenes. These compact features also enable a language-guided pruning strategy that identifies semantic redundancy, reducing the map's Gaussian count by over 60\% while maintaining rendering quality. Furthermore, we introduce a language-based loop detection approach that reuses these mapping features, eliminating the need for a separate detection model. Extensive experiments demonstrate that LEGO-SLAM achieves competitive mapping quality and tracking accuracy, all while providing open-vocabulary capabilities at 15 FPS.
翻译:三维高斯泼溅(3DGS)的最新进展使得同时定位与建图(SLAM)系统能够构建逼真的地图。然而,这些地图缺乏高级机器人交互所需的开放词汇语义理解。将语言特征集成到SLAM中仍是一个重大挑战,因为存储高维特征需要过多的内存和渲染开销,而现有基于静态模型的方法缺乏对新环境的适应性。为解决这些限制,我们提出了LEGO-SLAM(语言嵌入高斯优化SLAM),这是首个在基于3DGS的SLAM系统中实现实时开放词汇建图的框架。我们方法的核心是一个场景自适应编码器-解码器,它将高维语言嵌入蒸馏为紧凑的16维特征空间。该设计减少了每个高斯分布的内存占用并加速了渲染,从而实现了实时性能。与静态方法不同,我们的编码器能够在线适应未见场景。这些紧凑特征还支持一种语言引导的剪枝策略,可识别语义冗余,在保持渲染质量的同时将地图的高斯数量减少超过60%。此外,我们提出了一种基于语言的闭环检测方法,该方法重用这些建图特征,无需单独的检测模型。大量实验表明,LEGO-SLAM在实现15 FPS的开放词汇能力的同时,达到了具有竞争力的建图质量和跟踪精度。