影响多剂强化学习中的长期行为 (Influencing Long-Term Behavior in Multiagent Reinforcement Learning)

from arxiv, Accepted to NeurIPS 2022. The earlier version was presented at the Gamification and Multiagent Solutions Workshop (ICLR 2022) with a spotlight. Code at https://github.com/dkkim93/further and videos at https://sites.google.com/view/further-marl

The main challenge of multiagent reinforcement learning is the difficulty of learning useful policies in the presence of other simultaneously learning agents whose changing behaviors jointly affect the environment's transition and reward dynamics. An effective approach that has recently emerged for addressing this non-stationarity is for each agent to anticipate the learning of other agents and influence the evolution of future policies towards desirable behavior for its own benefit. Unfortunately, previous approaches for achieving this suffer from myopic evaluation, considering only a finite number of policy updates. As such, these methods can only influence transient future policies rather than achieving the promise of scalable equilibrium selection approaches that influence the behavior at convergence. In this paper, we propose a principled framework for considering the limiting policies of other agents as time approaches infinity. Specifically, we develop a new optimization objective that maximizes each agent's average reward by directly accounting for the impact of its behavior on the limiting set of policies that other agents will converge to. Our paper characterizes desirable solution concepts within this problem setting and provides practical approaches for optimizing over possible outcomes. As a result of our farsighted objective, we demonstrate better long-term performance than state-of-the-art baselines across a suite of diverse multiagent benchmark domains.

翻译：多试剂强化学习的主要挑战是,在其他同时学习者面前学习有用的政策很困难,而其他同时学习者的行为变化会共同影响环境的过渡和奖励动态。最近为处理这种不固定现象而出现的一种有效办法是,每个代理人预测其他代理人的学习情况并影响未来政策的演变,从而有利于其自身的利益。不幸的是,以前实现这一目标的方法受到短视评价的影响,只考虑有限数目的政策更新。因此,这些方法只能影响未来政策的过渡性,而不是实现影响趋同行为的可调整均衡选择方法的希望。在本文件中,我们提出了一个原则性框架,用于考虑其他代理人的限制政策,作为时间的无限性。具体地说,我们制定一个新的优化目标,通过直接计算每个代理人的行为对其他代理人将集中在一起的一套限制性政策的影响,最大限度地提高每个代理人的平均报酬。我们的文件将这一问题的确定为可取的解决办法概念,并为优化可能的结果提供切实可行的方法。由于我们具有远见的目标,我们展示了更好的长期业绩,而不是在不同的多试样基准领域中显示的状态基线。