双层零苏小型粉碎运动会的通用微型最小学习 Q 学习算法 (A Generalized Minimax Q-learning Algorithm for Two-Player Zero-Sum Stochastic Games)

We consider the problem of two-player zero-sum games. This problem is formulated as a min-max Markov game in the literature. The solution of this game, which is the min-max payoff, starting from a given state is called the min-max value of the state. In this work, we compute the solution of the two-player zero-sum game utilizing the technique of successive relaxation that has been successfully applied in the literature to compute a faster value iteration algorithm in the context of Markov Decision Processes. We extend the concept of successive relaxation to the setting of two-player zero-sum games. We show that, under a special structure on the game, this technique facilitates faster computation of the min-max value of the states. We then derive a generalized minimax Q-learning algorithm that computes the optimal policy when the model information is not known. Finally, we prove the convergence of the proposed generalized minimax Q-learning algorithm utilizing stochastic approximation techniques, under an assumption on the boundedness of iterates. Through experiments, we demonstrate the effectiveness of our proposed algorithm.

翻译：我们考虑的是两个玩家零和游戏的问题。这个问题在文献中被表述成一个小麦马克托夫游戏。这个游戏的解决方案, 也就是从给定状态开始的最小麦角回报, 叫做国家最小麦角值。在这项工作中, 我们用连续放松技术来计算两个玩家零和游戏的解决方案, 在文献中成功地应用这一技术来计算马可夫决策过程的更快的迭代算法。我们把连续放松的概念扩展至两个玩家零和游戏的设置。我们显示, 在游戏的特殊结构下, 这个技术可以加速计算国家的最小麦角值。然后我们得出一个通用的迷你麦片Q学习算法, 在不知道模型信息的情况下, 来计算最佳政策。最后, 我们证明, 利用随机近似的近似技术, 并假设它的范围。我们通过实验来展示我们提议的算法的有效性。