Reinforcement learning in cooperative multi-agent settings has recently advanced significantly in its scope, with applications in cooperative estimation for advertising, dynamic treatment regimes, distributed control, and federated learning. In this paper, we discuss the problem of cooperative multi-agent RL with function approximation, where a group of agents communicates with each other to jointly solve an episodic MDP. We demonstrate that via careful message-passing and cooperative value iteration, it is possible to achieve near-optimal no-regret learning even with a fixed constant communication budget. Next, we demonstrate that even in heterogeneous cooperative settings, it is possible to achieve Pareto-optimal no-regret learning with limited communication. Our work generalizes several ideas from the multi-agent contextual and multi-armed bandit literature to MDPs and reinforcement learning.
翻译:合作性多试剂环境下的强化学习在范围上最近取得了显著进步,在广告、动态治疗制度、分布控制和联合学习的合作估计应用中,我们讨论了多剂合作性RL和功能近似问题,一组代理相互沟通,共同解决分型MDP问题。我们证明,通过仔细传递信息与合作价值的迭代,即使有固定不变的通信预算,也有可能实现接近最佳的无雷学习。 其次,我们证明即使在多种合作环境中,也有可能以有限的通信实现最佳最佳无雷学习。我们的工作将多剂背景和多臂土匪文学的一些想法归纳为多剂背景和多臂土匪文学,并强化学习。