多试剂自然作用器 -- -- 环境强化学习等级 (Multi-agent Natural Actor-critic Reinforcement Learning Algorithms)

from arxiv, A very high-level summary of our revision is: In Section 3.5, we theoretically prove that the objective function value from the deterministic variant of MAN algorithms dominates that of the MAAC algorithm under some minimal conditions. It relies on the Lemma 2 of our paper: the minimum singular value of the Fisher information matrix is well within the reciprocal of the policy parameter dimension

Multi-agent actor-critic algorithms are an important part of the Reinforcement Learning paradigm. We propose three fully decentralized multi-agent natural actor-critic (MAN) algorithms in this work. The objective is to collectively find a joint policy that maximizes the average long-term return of these agents. In the absence of a central controller and to preserve privacy, agents communicate some information to their neighbors via a time-varying communication network. We prove convergence of all the 3 MAN algorithms to a globally asymptotically stable set of the ODE corresponding to actor update; these use linear function approximations. We show that the Kullback-Leibler divergence between policies of successive iterates is proportional to the objective function's gradient. We observe that the minimum singular value of the Fisher information matrix is well within the reciprocal of the policy parameter dimension. Using this, we theoretically show that the optimal value of the deterministic variant of the MAN algorithm at each iterate dominates that of the standard gradient-based multi-agent actor-critic (MAAC) algorithm. To our knowledge, it is a first such result in multi-agent reinforcement learning (MARL). To illustrate the usefulness of our proposed algorithms, we implement them on a bi-lane traffic network to reduce the average network congestion. We observe an almost 25\% reduction in the average congestion in 2 MAN algorithms; the average congestion in another MAN algorithm is on par with the MAAC algorithm. We also consider a generic $15$ agent MARL; the performance of the MAN algorithms is again as good as the MAAC algorithm.

翻译：强化学习模式的一个重要部分。我们在此工作中建议三个完全分散的多试剂自然行为者―― 加速( MAN) 算法。目标是集体寻找一个联合政策, 最大限度地提高这些代理商的平均长期回报率。在没有中央控制器的情况下, 并保护隐私, 代理商通过时间变化通信网络将一些信息传递给邻居。我们证明所有3个 MAN 算法都与一个全球无影响稳定的、与行为者更新相对应的 ODE 一组无影响的一般稳定; 这些使用线性函数近似值。我们显示, 连续迭代商政策之间的 Kullback- Leiper 差异与目标函数的梯度成比例成正比。我们观察到, 渔业信息矩阵的最低单值是在政策参数的对等范围内。我们利用这个理论来显示, 每一个代号的 MAN 算法的确定性变数最优值主宰着标准的基于梯度的多剂行为者―― 动作算法(MAAC) 的计算法; 我们所了解的是, 在多代代数的递算法中, 也是第一个在多级递解算算算算算法中, 中, 中, 将一个几乎可以降低 MAL 。