Existing distributed cooperative multi-agent reinforcement learning (MARL) frameworks usually assume undirected coordination graphs and communication graphs while estimating a global reward via consensus algorithms for policy evaluation. Such a framework may induce expensive communication costs and exhibit poor scalability due to requirement of global consensus. In this work, we study MARLs with directed coordination graphs, and propose a distributed RL algorithm where the local policy evaluations are based on local value functions. The local value function of each agent is obtained by local communication with its neighbors through a directed learning-induced communication graph, without using any consensus algorithm. A zeroth-order optimization (ZOO) approach based on parameter perturbation is employed to achieve gradient estimation. By comparing with existing ZOO-based RL algorithms, we show that our proposed distributed RL algorithm guarantees high scalability. A distributed resource allocation example is shown to illustrate the effectiveness of our algorithm.
翻译:现有分布式多试剂强化学习(MARL)框架通常采用非定向协调图表和通信图表,同时通过协商一致的政策评估算法估算全球奖励,这种框架可能会导致昂贵的通信费用,并由于全球共识的要求而显示不易调整。在这项工作中,我们用定向协调图表研究最低分配水平,并提出地方政策评价以当地价值功能为基础的分配式RL算法。每个代理商的当地价值功能是通过与邻国的当地通信通过定向学习引发的通信图表获得的,而没有使用任何共识式的通信图表。以参数渗透法为基础的零级优化(ZOOO)方法用于实现梯度估计。通过与现有的基于ZOO的RL算法进行比较,我们表明我们拟议的分配式RL算法保证了高度的可调整性。一个分布式资源分配示例显示了我们的算法的有效性。