We examine the long-run behavior of multi-agent online learning in games that evolve over time. Specifically, we focus on a wide class of policies based on mirror descent, and we show that the induced sequence of play (a) converges to Nash equilibrium in time-varying games that stabilize in the long run to a strictly monotone limit; and (b) it stays asymptotically close to the evolving equilibrium of the sequence of stage games (assuming they are strongly monotone). Our results apply to both gradient-based and payoff-based feedback - i.e., the "bandit feedback" case where players only get to observe the payoffs of their chosen actions.
翻译:我们审视了多试剂在线学习在随时间演变的游戏中的长期行为。 具体地说,我们关注一系列基于镜像下降的广泛的政策,并且我们展示了诱导游戏顺序:(a)在时间变化的游戏中,与纳什平衡相趋合,长期稳定到严格的单调限值;(b)与阶段游戏序列不断演变的平衡(假设它们强烈的单调值 ) 毫不相干地接近。 我们的结果既适用于基于梯度的反馈,也适用于基于回报的反馈 — — 即“弯曲反馈”案例,在这个案例中,玩家只能观察他们所选择的行动的回报。