Efficient exploration in deep cooperative multi-agent reinforcement learning (MARL) still remains challenging in complex coordination problems. In this paper, we introduce a novel Episodic Multi-agent reinforcement learning with Curiosity-driven exploration, called EMC. We leverage an insight of popular factorized MARL algorithms that the "induced" individual Q-values, i.e., the individual utility functions used for local execution, are the embeddings of local action-observation histories, and can capture the interaction between agents due to reward backpropagation during centralized training. Therefore, we use prediction errors of individual Q-values as intrinsic rewards for coordinated exploration and utilize episodic memory to exploit explored informative experience to boost policy training. As the dynamics of an agent's individual Q-value function captures the novelty of states and the influence from other agents, our intrinsic reward can induce coordinated exploration to new or promising states. We illustrate the advantages of our method by didactic examples, and demonstrate its significant outperformance over state-of-the-art MARL baselines on challenging tasks in the StarCraft II micromanagement benchmark.
翻译:在复杂的协调问题中,在深入合作的多剂强化学习(MARL)方面,有效的探索仍然具有挑战性。在本文件中,我们引入了一种创新的Episodidi多剂强化学习,与Curiosity驱动的探索(称为EMC)一起学习。我们利用了一种流行的、具有因素性的MARL算法的洞察力,即“诱发”个体Q值,即用于当地执行的单个实用功能,是地方行动观察历史的嵌入,并能够捕捉到由于在集中培训期间奖励反向反向反应而导致的代理体之间的相互作用。因此,我们用个人Q值预测错误作为协调探索的内在奖赏,并利用偶发记忆利用所探索的丰富经验促进政策培训。由于一个代理人的个人量化功能的动态捕捉了国家的新特点和其他代理人的影响,我们的内在奖励可以促使协调探索到新的或有希望的州。我们的方法的优点是通过实践实例来说明我们的方法的优势,并表明它在StarCraft II微观管理基准中挑战性任务的先进MARL基准方面表现的显著优异。