We study reinforcement learning (RL) in a setting with a network of agents whose states and actions interact in a local manner where the objective is to find localized policies such that the (discounted) global reward is maximized. A fundamental challenge in this setting is that the state-action space size scales exponentially in the number of agents, rendering the problem intractable for large networks. In this paper, we propose a Scalable Actor Critic (SAC) framework that exploits the network structure and finds a localized policy that is an $O(\rho^{\kappa})$-approximation of a stationary point of the objective for some $\rho\in(0,1)$, with complexity that scales with the local state-action space size of the largest $\kappa$-hop neighborhood of the network. We illustrate our model and approach using examples from wireless communication, epidemics and traffic.
翻译:我们研究的是强化学习(RL), 在一个由州和行动以当地方式相互作用的代理商组成的网络环境中,其目标是找到本地化政策,从而最大限度地实现(折扣的)全球奖励。在这一背景下,一个根本性的挑战就是州-行动空间规模在代理商数量上成倍增长,使问题难以解决到大型网络中。 在本文件中,我们提出一个可扩展的行为者批评(SAC)框架,利用网络结构,并找到一种本地化政策,即以$(rhoä ⁇ kappa}($$)为单位,实现约0.1美元的目标固定点,该比例与网络中最大的州-行动空间规模($\kappa-hop)相邻的当地州-行动空间规模十分复杂。我们用无线通信、流行病和交通的例子来说明我们的模型和办法。