Centralised training (CT) is the basis for many popular multi-agent reinforcement learning (MARL) methods because it allows agents to quickly learn high-performing policies. However, CT relies on agents learning from one-off observations of other agents' actions at a given state. Because MARL agents explore and update their policies during training, these observations often provide poor predictions about other agents' behaviour and the expected return for a given action. CT methods therefore suffer from high variance and error-prone estimates, harming learning. CT methods also suffer from explosive growth in complexity due to the reliance on global observations, unless strong factorisation restrictions are imposed (e.g., monotonic reward functions for QMIX). We address these challenges with a new semi-centralised MARL framework that performs policy-embedded training and decentralised execution. Our method, policy embedded reinforcement learning algorithm (PERLA), is an enhancement tool for Actor-Critic MARL algorithms that leverages a novel parameter sharing protocol and policy embedding method to maintain estimates that account for other agents' behaviour. Our theory proves PERLA dramatically reduces the variance in value estimates. Unlike various CT methods, PERLA, which seamlessly adopts MARL algorithms, scales easily with the number of agents without the need for restrictive factorisation assumptions. We demonstrate PERLA's superior empirical performance and efficient scaling in benchmark environments including StarCraft Micromanagement II and Multi-agent Mujoco
翻译:中央化培训是许多受欢迎的多剂强化学习方法的基础,因为这种方法使代理商能够迅速学习高绩效政策;然而,中央化培训依赖代理商学习特定国家其他代理商行为的一次性观察。由于MAR代理商在培训期间探索并更新其政策,这些观测往往对其他代理商的行为和某一行动的预期回报预测不佳。因此,CT方法存在差异和易出错的高度估计,导致学习受到损害。由于依赖全球观察,CT方法也因复杂性的爆炸性增长而受到影响,除非施加强大的因素化限制(例如QMIX的单一奖励功能)。我们用一个新的半中央化的MARL框架来应对这些挑战,这个框架进行政策整合培训和分散执行。我们的方法,即政策嵌入的强化学习算法(PERLA),是利用新的参数共享协议和政策嵌入方法来维持其他代理商行为的估计。我们的理论证明,PERLA(包括无止损的A级的高级估定值),而我们没有采用各种标准级化标准。