In game-theoretic learning, several agents are simultaneously following their individual interests, so the environment is non-stationary from each player's perspective. In this context, the performance of a learning algorithm is often measured by its regret. However, no-regret algorithms are not created equal in terms of game-theoretic guarantees: depending on how they are tuned, some of them may drive the system to an equilibrium, while others could produce cyclic, chaotic, or otherwise divergent trajectories. To account for this, we propose a range of no-regret policies based on optimistic mirror descent, with the following desirable properties: i) they do not require any prior tuning or knowledge of the game; ii) they all achieve O(\sqrt{T}) regret against arbitrary, adversarial opponents; and iii) they converge to the best response against convergent opponents. Also, if employed by all players, then iv) they guarantee O(1) social regret; while v) the induced sequence of play converges to Nash equilibrium with O(1) individual regret in all variationally stable games (a class of games that includes all monotone and convex-concave zero-sum games).
翻译:在游戏理论学习中,若干代理人同时关注他们的个人利益,因此,从每个玩家的角度讲,环境是非静止的。在这方面,学习算法的性能往往以遗憾来衡量。然而,在游戏理论保障方面,不累累算法的产生并不等于游戏理论保障:取决于它们是如何调整的,其中一些可以将系统推向平衡,而另一些则可以产生循环、混乱或其他不同的轨迹。为此,我们提议了一系列基于乐观镜像下降的不累赘政策,其可取的特性如下:(一)它们不需要任何事先调整或了解游戏;(二)它们都对任意的、对抗对手感到遗憾;以及(三)它们与对趋同的反对者作出最佳反应。此外,如果所有玩家都使用,那么它们就会保证O(1)社会悔恨;以及(五)在所有变异式游戏(包括单调游戏和组合游戏)中,游戏的顺序会与O(1)个人后悔。