An ideal strategy in zero-sum games should not only grant the player an average reward no less than the value of Nash equilibrium, but also exploit the (adaptive) opponents when they are suboptimal. While most existing works in Markov games focus exclusively on the former objective, it remains open whether we can achieve both objectives simultaneously. To address this problem, this work studies no-regret learning in Markov games with adversarial opponents when competing against the best fixed policy in hindsight. Along this direction, we present a new complete set of positive and negative results: When the policies of the opponents are revealed at the end of each episode, we propose new efficient algorithms achieving $\sqrt{K}$-regret bounds when either (1) the baseline policy class is small or (2) the opponent's policy class is small. This is complemented with an exponential lower bound when neither conditions are true. When the policies of the opponents are not revealed, we prove a statistical hardness result even in the most favorable scenario when both above conditions are true. Our hardness result is much stronger than the existing hardness results which either only involve computational hardness, or require further restrictions on the algorithms.
翻译:零和游戏的理想策略不仅应该给予玩家平均奖赏不低于纳什平衡值,而且当对手(适应性)的对手处于非最佳状态时,应该利用他们。虽然Markov游戏中的大多数现有作品都只关注前一个目标,但是我们能否同时实现这两个目标仍然是开放的。为了解决这个问题,这项工作研究在马可夫游戏中,当对手与后视最佳固定政策竞争时,与对立对手的对立对手进行不回报学习。沿着这个方向,我们提出了一套新的完整的正面和负面结果:当对手的政策在每集的结尾暴露出来时,我们建议新的高效算法达到$\sqrt{K}$regret 界限,如果(1) 基线政策等级小,或者(2) 对手的政策等级小的话。如果条件都不真实,则以指数性较低的约束作为补充。当对手的政策没有被披露时,我们就会证明在最有利的假设中,当上述条件都属实时,我们的硬性结果比现有的硬性结果要强得多,因为后者只涉及计算硬性,或者进一步需要进一步限制。