We introduce LocoMamba, a vision-driven cross-modal DRL framework built on selective state-space models, specifically leveraging Mamba, that achieves near-linear-time sequence modeling, effectively captures long-range dependencies, and enables efficient training with longer sequences. First, we embed proprioceptive states with a multilayer perceptron and patchify depth images with a lightweight convolutional neural network, producing compact tokens that improve state representation. Second, stacked Mamba layers fuse these tokens via near-linear-time selective scanning, reducing latency and memory footprint, remaining robust to token length and image resolution, and providing an inductive bias that mitigates overfitting. Third, we train the policy end-to-end with Proximal Policy Optimization under terrain and appearance randomization and an obstacle-density curriculum, using a compact state-centric reward that balances progress, smoothness, and safety. We evaluate our method in challenging simulated environments with static and moving obstacles as well as uneven terrain. Compared with state-of-the-art baselines, our method achieves higher returns and success rates with fewer collisions, exhibits stronger generalization to unseen terrains and obstacle densities, and improves training efficiency by converging in fewer updates under the same compute budget.
翻译:本文提出LocoMamba,一种基于选择性状态空间模型(特别是利用Mamba)构建的视觉驱动跨模态深度强化学习框架,该框架实现了近线性时间序列建模,有效捕获长程依赖关系,并支持对更长序列的高效训练。首先,我们使用多层感知机嵌入本体感知状态,并通过轻量级卷积神经网络对深度图像进行分块化处理,生成紧凑的标记以改进状态表示。其次,堆叠的Mamba层通过近线性时间选择性扫描融合这些标记,降低了延迟和内存占用,对标记长度和图像分辨率保持鲁棒性,并提供了一种减轻过拟合的归纳偏置。第三,我们在地形与外观随机化以及障碍物密度课程学习的设置下,采用紧凑的以状态为中心的奖励函数(平衡前进速度、运动平滑度与安全性),通过近端策略优化对策略进行端到端训练。我们在包含静态与动态障碍物以及不平坦地形的挑战性仿真环境中评估了所提方法。与现有先进基线相比,我们的方法以更少的碰撞次数获得了更高的回报与成功率,对未见地形和障碍物密度表现出更强的泛化能力,并在相同计算预算下通过更少的更新轮次收敛,从而提升了训练效率。