Multi-robot systems can benefit from reinforcement learning (RL) algorithms that learn behaviours in a small number of trials, a property known as sample efficiency. This research thus investigates the use of learned world models to improve sample efficiency. We present a novel multi-agent model-based RL algorithm: Multi-Agent Model-Based Policy Optimization (MAMBPO), utilizing the Centralized Learning for Decentralized Execution (CLDE) framework. CLDE algorithms allow a group of agents to act in a fully decentralized manner after training. This is a desirable property for many systems comprising of multiple robots. MAMBPO uses a learned world model to improve sample efficiency compared to model-free Multi-Agent Soft Actor-Critic (MASAC). We demonstrate this on two simulated multi-robot tasks, where MAMBPO achieves a similar performance to MASAC, but requires far fewer samples to do so. Through this, we take an important step towards making real-life learning for multi-robot systems possible.
翻译:多机器人系统可以受益于强化学习(RL)算法,这种算法在为数不多的试验中学习行为,一种称为抽样效率的财产。因此,这一研究调查了使用世界学习模型来提高抽样效率的情况。我们展示了一种新的多剂模型式RL算法:多剂模型政策优化(MAMBPO),利用分散执行中央化学习框架(CLDE)进行。E算法允许一组代理人在培训后以完全分散的方式行事。这是由多个机器人组成的许多系统的理想属性。MAMBPO使用一个学习型世界模型来提高抽样效率,而没有模型的多剂软剂行为者-crict(MASAC)。我们在两个模拟多机器人任务上展示了这一点,MAMBPO取得了与MAAC相似的性能,但需要更少的样本才能这样做。通过这个方法,我们迈出了重要的一步,为多机器人系统进行真实生活学习。