Policy gradient methods have become popular in multi-agent reinforcement learning, but they suffer from high variance due to the presence of environmental stochasticity and exploring agents (i.e., non-stationarity), which is potentially worsened by the difficulty in credit assignment. As a result, there is a need for a method that is not only capable of efficiently solving the above two problems but also robust enough to solve a variety of tasks. To this end, we propose a new multi-agent policy gradient method, called Robust Local Advantage (ROLA) Actor-Critic. ROLA allows each agent to learn an individual action-value function as a local critic as well as ameliorating environment non-stationarity via a novel centralized training approach based on a centralized critic. By using this local critic, each agent calculates a baseline to reduce variance on its policy gradient estimation, which results in an expected advantage action-value over other agents' choices that implicitly improves credit assignment. We evaluate ROLA across diverse benchmarks and show its robustness and effectiveness over a number of state-of-the-art multi-agent policy gradient algorithms.
翻译:政策梯度方法在多试剂强化学习中变得很受欢迎,但由于存在环境随机性和探险剂(即非静态),这些方法差异很大,可能因信用分配的困难而恶化。因此,需要一种不仅能够有效解决上述两个问题,而且足以解决各种任务的方法。为此目的,我们提议一种新的多试剂政策梯度方法,称为“Robust Lational Advantage (ROLA) Actor-Critical ” (ROLA) Actor-Critical。ROLA允许每个代理人学习个人作为当地评论家的行动价值功能,并通过基于集中批评家的新的集中化培训方法,改善不常态的环境。每个代理人都计算出一个基线,以减少其政策梯度估计的差异,从而实现预期的有利行动价值,而其他代理人的选择则隐含地改进信用分配。我们从不同基准中评估了ROLA,并显示它对于一些州级多试剂政策梯度梯度算法的稳健性和有效性。