Efficient exploration is important for reinforcement learners (RL) to achieve high rewards. In multi-agent systems, coordinated exploration and behaviour is critical for agents to jointly achieve optimal outcomes. In this paper, we introduce a new general framework for improving coordination and performance of multi-agent reinforcement learners (MARL). Our framework, named Learnable Intrinsic-Reward Generation Selection algorithm (LIGS) introduces an adaptive learner, Generator that observes the agents and learns to construct intrinsic rewards online that coordinate the agents' joint exploration and joint behaviour. Using a novel combination of reinforcement learning (RL) and switching controls, LIGS determines the best states to learn to add intrinsic rewards which leads to a highly efficient learning process. LIGS can subdivide complex tasks making them easier to solve and enables systems of RL agents to quickly solve environments with sparse rewards. LIGS can seamlessly adopt existing multi-agent RL algorithms and our theory shows that it ensures convergence to joint policies that deliver higher system performance. We demonstrate the superior performance of the LIGS framework in challenging tasks in Foraging and StarCraft II.
翻译:高效的探索对于强化学习者(RL)获得高回报非常重要。 在多试剂系统中,协调的探索和行为对于代理商共同取得最佳成果至关重要。 在本文中,我们引入了一个新的总体框架来改进多剂强化学习者(MARL)的协调与绩效。我们的名为“可学习的内在-再生”选择算法(LIGS)的框架引入了适应性学习者、观察代理商并学会在网上构建内在收益以协调代理商的联合探索和联合行为。在强化学习(RL)和转换控制的新组合中,LIGS决定了最佳国家学习增加内在收益以导致高效的学习过程。 LIGS可以将复杂任务细分为辅助,使其更容易解决并使RL代理商系统能够以微薄的回报快速解决环境。 LIGS可以无缝地采用现有的多剂RL算法,以及我们的理论表明,它能确保与提供更高系统绩效的联合政策趋同。我们展示LIGS框架在挑战调控控和StarCraft II的任务方面的优异表现。