Efficient exploration strategy is one of essential issues in cooperative multi-agent reinforcement learning (MARL) algorithms requiring complex coordination. In this study, we introduce a new exploration method with the strangeness that can be easily incorporated into any centralized training and decentralized execution (CTDE)-based MARL algorithms. The strangeness refers to the degree of unfamiliarity of the observations that an agent visits. In order to give the observation strangeness a global perspective, it is also augmented with the the degree of unfamiliarity of the visited entire state. The exploration bonus is obtained from the strangeness and the proposed exploration method is not much affected by stochastic transitions commonly observed in MARL tasks. To prevent a high exploration bonus from making the MARL training insensitive to extrinsic rewards, we also propose a separate action-value function trained by both extrinsic reward and exploration bonus, on which a behavioral policy to generate transitions is designed based. It makes the CTDE-based MARL algorithms more stable when they are used with an exploration method. Through a comparative evaluation in didactic examples and the StarCraft Multi-Agent Challenge, we show that the proposed exploration method achieves significant performance improvement in the CTDE-based MARL algorithms.
翻译:高效勘探战略是合作性多试剂强化学习(MARL)算法中需要复杂协调的基本问题之一。在本研究中,我们引入了一种新的探索方法,其奇特性很容易被纳入任何集中培训和分散执行(CTDE)MARL算法中。这种奇特性是指代理人访问时所观察到的不熟悉程度。为了给观察的奇异性提供一种全球视角,它也随着所访问的整个国家不熟悉的程度而得到加强。勘探红利来自奇特性,而提议的勘探方法没有受到MARL任务中常见的随机过渡的很大影响。为了防止高额的勘探奖金使MARL培训对外部奖励敏感,我们还提议了一项单独的行动价值功能,由外部奖励和勘探奖金两方面加以培训,据此设计一种行为政策来产生转变。它使以CTDE为基础的MARL算法在使用勘探方法时更加稳定。通过实验性实例的比较评估和基于StarCraft的多Agent Challengebors,我们展示了拟议的业绩改进方法。