在多剂强化学习中学习学习的政策梯度比值 (A Policy Gradient Algorithm for Learning to Learn in Multiagent Reinforcement Learning)

A fundamental challenge in multiagent reinforcement learning is to learn beneficial behaviors in a shared environment with other agents that are also simultaneously learning. In particular, each agent perceives the environment as effectively non-stationary due to the changing policies of other agents. Moreover, each agent is itself constantly learning, leading to natural nonstationarity in the distribution of experiences encountered. In this paper, we propose a novel meta-multiagent policy gradient theorem that directly accommodates for the non-stationary policy dynamics inherent to these multiagent settings. This is achieved by modeling our gradient updates to directly consider both an agent's own non-stationary policy dynamics and the non-stationary policy dynamics of other agents interacting with it in the environment. We find that our theoretically grounded approach provides a general solution to the multiagent learning problem, which inherently combines key aspects of previous state of the art approaches on this topic. We test our method on several multiagent benchmarks and demonstrate a more efficient ability to adapt to new agents as they learn than previous related approaches across the spectrum of mixed incentive, competitive, and cooperative environments.

翻译：多试剂强化学习的基本挑战之一是学习与同时学习的其他代理人在共享环境中的有益行为。特别是,每个代理人认为环境由于其他代理人的政策变化而实际上不是静止的。此外,每个代理人本身不断学习,导致在分配所遭遇的经验方面自然而然的不固定现象。在本文件中,我们提议了一个新的元多试剂政策梯度理论,直接适应这些多试剂环境所固有的非静止政策动态。这是通过模拟我们的梯度更新来实现的,以直接考虑一个代理人自己的非静止政策动态和其他代理人在环境中与之互动的其他代理人的非静止政策动态。我们发现,我们基于理论的处理办法为多试剂学习问题提供了一种普遍的解决办法,它必然地将以前关于这个主题的先进方法的关键方面结合起来。我们用几个多试剂基准来测试我们的方法,并表明在新代理人学习时比以前在各种混合激励、竞争和合作环境中的相关方法更能适应新的代理人。