Autonomous vehicles (AV) offer a cost-effective solution for scientific missions such as underwater tracking. Recently, reinforcement learning (RL) has emerged as a powerful method for controlling AVs in complex marine environments. However, scaling these techniques to a fleet--essential for multi-target tracking or targets with rapid, unpredictable motion--presents significant computational challenges. Multi-Agent Reinforcement Learning (MARL) is notoriously sample-inefficient, and while high-fidelity simulators like Gazebo's LRAUV provide 100x faster-than-real-time single-robot simulations, they offer no significant speedup for multi-vehicle scenarios, making MARL training impractical. To address these limitations, we propose an iterative distillation method that transfers high-fidelity simulations into a simplified, GPU-accelerated environment while preserving high-level dynamics. This approach achieves up to a 30,000x speedup over Gazebo through parallelization, enabling efficient training via end-to-end GPU acceleration. Additionally, we introduce a novel Transformer-based architecture (TransfMAPPO) that learns multi-agent policies invariant to the number of agents and targets, significantly improving sample efficiency. Following large-scale curriculum learning conducted entirely on GPU, we perform extensive evaluations in Gazebo, demonstrating that our method maintains tracking errors below 5 meters over extended durations, even in the presence of multiple fast-moving targets. This work bridges the gap between large-scale MARL training and high-fidelity deployment, providing a scalable framework for autonomous fleet control in real-world sea missions.
翻译:自主载具为水下追踪等科学任务提供了一种经济高效的解决方案。近年来,强化学习已成为在复杂海洋环境中控制自主载具的强大方法。然而,将这些技术扩展至车队规模——对于多目标追踪或具有快速、不可预测运动的目标而言至关重要——带来了显著的计算挑战。多智能体强化学习以样本效率低下而闻名,尽管Gazebo的LRAUV等高保真仿真器能提供百倍于实时速度的单机器人仿真,但在多载具场景中无法实现显著加速,使得多智能体强化学习训练难以实施。为应对这些局限,我们提出一种迭代蒸馏方法,将高保真仿真迁移至简化的GPU加速环境,同时保持高层动力学特性。该方法通过并行化实现了较Gazebo高达30,000倍的加速,借助端到端GPU加速实现了高效训练。此外,我们引入一种新颖的基于Transformer的架构(TransfMAPPO),该架构学习对智能体数量和目标数量保持不变的多智能体策略,显著提升了样本效率。在完全基于GPU的大规模课程学习后,我们在Gazebo中进行了广泛评估,结果表明即使存在多个快速移动目标,我们的方法也能在长时间内将追踪误差维持在5米以下。这项工作弥合了大规模多智能体强化学习训练与高保真部署之间的差距,为现实海域任务中的自主车队控制提供了可扩展的框架。