奖励多试剂系统加强加强强化学习 (Reward-Reinforced Reinforcement Learning for Multi-agent Systems)

Reinforcement learning algorithms in multi-agent systems deliver highly resilient and adaptable solutions for common problems in telecommunications,aerospace, and industrial robotics. However, achieving an optimal global goal remains a persistent obstacle for collaborative multi-agent systems, where learning affects the behaviour of more than one agent. A number of nonlinear function approximation methods have been proposed for solving the Bellman equation, which describe a recursive format of an optimal policy. However, how to leverage the value distribution based on reinforcement learning, and how to improve the efficiency and efficacy of such systems remain a challenge. In this work, we developed a reward-reinforced generative adversarial network to represent the distribution of the value function, replacing the approximation of Bellman updates. We demonstrated our method is resilient and outperforms other conventional reinforcement learning methods. This method is also applied to a practical case study: maximising the number of user connections to autonomous airborne base stations in a mobile communication network. Our method maximises the data likelihood using a cost function under which agents have optimal learned behaviours. This reward-reinforced generative adversarial network can be used as ageneric framework for multi-agent learning at the system level

翻译：多试剂系统中的强化学习算法为电信、空气空间和工业机器人方面的共同问题提供了具有高度复原力和适应性强的解决方案。然而,实现最佳的全球目标仍然是合作性多试剂系统的一个持续障碍,因为学习影响不止一个代理体的行为。已提出一些非线性功能近似方法,以解决Bellman等式,该等式描述最佳政策的循环格式。然而,如何利用基于强化学习的价值分配,以及如何提高这些系统的效率和效力,仍是一个挑战。在这项工作中,我们开发了一个奖励性增强的基因对抗网络,以代表价值功能的分布,取代Bellman更新的近似。我们展示了我们的方法具有弹性,并超越了其他常规强化学习方法。这个方法也适用于一项实用案例研究:在移动通信网络中将用户与自主空基站的连接次数最大化。我们的方法利用成本功能使数据可能性最大化,使代理体具备最佳的学习行为。这种奖励性强化基因对抗网络可以用作系统一级多试剂学习的年龄框架。