Learning in multi-agent systems is highly challenging due to the inherent complexity introduced by agents' interactions. We tackle systems with a huge population of interacting agents (e.g., swarms) via Mean-Field Control (MFC). MFC considers an asymptotically infinite population of identical agents that aim to collaboratively maximize the collective reward. Specifically, we consider the case of unknown system dynamics where the goal is to simultaneously optimize for the rewards and learn from experience. We propose an efficient model-based reinforcement learning algorithm $\text{M}^3\text{-UCRL}$ that runs in episodes and provably solves this problem. $\text{M}^3\text{-UCRL}$ uses upper-confidence bounds to balance exploration and exploitation during policy learning. Our main theoretical contributions are the first general regret bounds for model-based RL for MFC, obtained via a novel mean-field type analysis. $\text{M}^3\text{-UCRL}$ can be instantiated with different models such as neural networks or Gaussian Processes, and effectively combined with neural network policy learning. We empirically demonstrate the convergence of $\text{M}^3\text{-UCRL}$ on the swarm motion problem of controlling an infinite population of agents seeking to maximize location-dependent reward and avoid congested areas.
翻译:多试剂系统中的学习由于代理人互动带来的内在复杂性而具有高度挑战性。我们通过平均战地控制(MFC)处理大量互动代理人(如群)的系统。MFC认为,在政策学习期间,完全无止境的相同代理人人数无穷无尽,目的是合作最大化集体奖赏。具体地说,我们考虑未知的系统动态,其目标是同时优化奖励和从经验中学习。我们建议一种基于模型的高效强化学习算法 $\ text{M{3\ text{UCRL}, 以各种模式为基础,如以时序运行和可可可识别的方式解决这个问题。$\ text{M3\ text{UCRL}$利用在政策学习中平衡探索和开发的上限。我们的主要理论贡献是MFC公司基于模型的RL的首个一般遗憾框框框,其目标就是同时优化奖励和从新的中学习。 $text{M3\text{UCRL}我们可以与不同的模型网络或高压进程/调调调调调调高的模板政策。