Learning from repeated play in a fixed two-player zero-sum game is a classic problem in game theory and online learning. We consider a variant of this problem where the game payoff matrix changes over time, possibly in an adversarial manner. We first present three performance measures to guide the algorithmic design for this problem: 1) the well-studied individual regret, 2) an extension of duality gap, and 3) a new measure called dynamic Nash Equilibrium regret, which quantifies the cumulative difference between the player's payoff and the minimax game value. Next, we develop a single parameter-free algorithm that simultaneously enjoys favorable guarantees under all these three performance measures. These guarantees are adaptive to different non-stationarity measures of the payoff matrices and, importantly, recover the best known results when the payoff matrix is fixed. Our algorithm is based on a two-layer structure with a meta-algorithm learning over a group of black-box base-learners satisfying a certain property, along with several novel ingredients specifically designed for the time-varying game setting. Empirical results further validate the effectiveness of our algorithm.
翻译:从固定的双玩零和游戏中反复学习是一个经典的游戏理论和在线学习问题。 我们考虑这个问题的变式, 游戏的回报矩阵可能以对抗的方式随时间而变化。 我们首先提出三种性能措施来指导这个问题的算法设计:(1) 研究周密的个人遗憾,(2) 双重性差距的延伸,(3) 称为动态Nash Equilemium 遗憾的新措施,它量化了玩家报酬和迷你马克思游戏价值之间的累积差异。 其次, 我们开发了一个单一的无参数算法, 在这三个性能措施下同时享有有利的保证。 这些保证是适应不同的非固定性报酬矩阵衡量标准, 更重要的是, 在固定报酬矩阵时恢复已知的最佳结果。 我们的算法基于一个双层结构, 在一组黑盒基列企业中学习满足某种属性, 以及若干专门为时间对调游戏设定的新元素。 Empricalalal结果进一步验证了我们的算法的有效性。