时间差异的确定性限值,加强随机游戏的强化学习 (Deterministic limit of temporal difference reinforcement learning for stochastic games)

Reinforcement learning in multiagent systems has been studied in the fields of economic game theory, artificial intelligence and statistical physics by developing an analytical understanding of the learning dynamics (often in relation to the replicator dynamics of evolutionary game theory). However, the majority of these analytical studies focuses on repeated normal form games, which only have a single environmental state. Environmental dynamics, i.e., changes in the state of an environment affecting the agents' payoffs has received less attention, lacking a universal method to obtain deterministic equations from established multistate reinforcement learning algorithms. In this work we present a novel methodological extension, separating the interaction from the adaptation time scale, to derive the deterministic limit of a general class of reinforcement learning algorithms, called temporal difference learning. This form of learning is equipped to function in more realistic multistate environments by using the estimated value of future environmental states to adapt the agent's behavior. We demonstrate the potential of our method with the three well established learning algorithms Q learning, SARSA learning and Actor-Critic learning. Illustrations of their dynamics on two multiagent, multistate environments reveal a wide range of different dynamical regimes, such as convergence to fixed points, limit cycles, and even deterministic chaos.

翻译：在经济游戏理论、人工智能和统计物理学领域,通过对学习动态(往往与进化游戏理论的复制者动态相关)的分析性理解,对多试剂系统中的强化学习进行了研究。然而,这些分析研究大多侧重于重复的普通游戏形式,只有单一的环境状态。环境动态,即影响代理人报酬的环境状况的变化,没有受到更多的注意,缺乏从既定的多州强化学习算法中获得确定性方程式的普遍方法。在这项工作中,我们提出了一个新颖的方法扩展,将互动与适应时间尺度分开,以得出总体强化学习算法类别的确定性限度,称为时间差异学习。这种形式的学习形式能够在更现实的多状态环境中发挥作用,利用未来环境状态的估计价值来调整代理人的行为。我们展示了我们的方法的潜力,通过三个公认的学习算法学习Q学习、SA学习和Actor-Criticle学习。在两个多试剂、甚至多州环境中的动态说明,揭示了不同的动态周期,如固定的固定点等固定的周期。