As one of the most popular methods in the field of reinforcement learning, Q-learning has received increasing attention. Recently, there have been more theoretical works on the regret bound of algorithms that belong to the Q-learning class in different settings. In this paper, we analyze the cumulative regret when conducting Nash Q-learning algorithm on 2-player turn-based stochastic Markov games (2-TBSG), and propose the very first gap dependent logarithmic upper bounds in the episodic tabular setting. This bound matches the theoretical lower bound only up to a logarithmic term. Furthermore, we extend the conclusion to the discounted game setting with infinite horizon and propose a similar gap dependent logarithmic regret bound. Also, under the linear MDP assumption, we obtain another logarithmic regret for 2-TBSG, in both centralized and independent settings.
翻译:作为加强学习领域最受欢迎的方法之一,Q-学习受到越来越多的关注。最近,关于属于不同环境的Q-学习类的算法的遗憾约束,我们做了更多的理论工作。在本文中,我们分析了在2个球员转折的Markov游戏(2-TBSG)上进行Nash Q-学习算法时累积的遗憾,并提出了在偶发表格设置中首个差距依赖对数的上限。这一界限与理论下限相符,但仅限于一个对数术语。此外,我们把结论扩展至具有无限视野的折扣游戏设置,并提出了类似的基于对数的对数约束。此外,在线性 MDP假设下,我们在中央和独立环境下为2 TBSG 获取了另一种对数的遗憾。