Multi-agent reinforcement learning has been successfully applied to a number of challenging problems. Despite these empirical successes, theoretical understanding of different algorithms is lacking, primarily due to the curse of dimensionality caused by the exponential growth of the state-action space with the number of agents. We study a fundamental problem of multi-agent linear quadratic regulator (LQR) in a setting where the agents are partially exchangeable. In this setting, we develop a hierarchical actor-critic algorithm, whose computational complexity is independent of the total number of agents, and prove its global linear convergence to the optimal policy. As LQRs are often used to approximate general dynamic systems, this paper provides an important step towards a better understanding of general hierarchical mean-field multi-agent reinforcement learning.
翻译:多剂强化学习已成功地应用于若干具有挑战性的问题。尽管取得了这些经验性的成功,但对于不同算法的理论理解仍然缺乏,这主要是因为国家行动空间与代理人数量成倍增长造成的对维度的诅咒。我们研究了多剂线性二次调节器(LQR)在代理器部分可互换的环境中的根本问题。在这个背景下,我们开发了一种等级级的行为者-批评算法,其计算复杂性独立于代理人总数,并证明它与最佳政策的全球线性趋同。 由于LQR通常用于接近一般动态系统,本文为更好地理解普通等级平均多剂强化学习提供了重要的一步。