Recent extensions to dynamic games of the well-known fictitious play learning procedure in static games were proved to globally converge to stationary Nash equilibria in two important classes of dynamic games (zero-sum and identical-interest discounted stochastic games). However, those decentralized algorithms need the players to know exactly the model (the transition probabilities and their payoffs at every stage). To overcome these strong assumptions, our paper introduces regularizations of the systems in (Leslie 2020; Baudin 2022) to construct a family of new decentralized learning algorithms which are model-free (players don't know the transitions and their payoffs are perturbed at every stage). Our procedures can be seen as extensions to stochastic games of the classical smooth fictitious play learning procedures in static games (where the players best responses are regularized, thanks to a smooth strictly concave perturbation of their payoff functions). We prove the convergence of our family of procedures to stationary regularized Nash equilibria in zero-sum and identical-interest discounted stochastic games. The proof uses the continuous smooth best-response dynamics counterparts, and stochastic approximation methods. When there is only one player, our problem is an instance of Reinforcement Learning and our procedures are proved to globally converge to the optimal stationary policy of the regularized MDP. In that sense, they can be seen as an alternative to the well known Q-learning procedure.
翻译:在静态游戏中,众所周知的虚玩游戏学习程序的动态游戏的最近扩展被证明在全球范围趋同,在两种重要的动态游戏(零和同价折扣游戏)中,固定的Nash 公平均衡(零和同价折扣游戏)。然而,这些分散式算法需要玩家确切地了解模型(过渡概率及其在每个阶段的回报率)。为了克服这些强有力的假设,我们的论文引入了系统规范化(Leslie 2020;Baudin 2022),以构建一套没有模型的新的分散式学习算法(玩家不知道过渡和他们的报酬在每一个阶段都受到干扰)。我们的程序可以被看作是在静态游戏中,传统平滑的模拟游戏学习程序的随机游戏的延伸(玩家的最佳反应是正常的,因为其报酬功能是平坦调的,我们的程序的组合与固定式正常的纳什均衡式的零和同价折扣游戏的组合(玩家不知道的过渡和回报率是每个阶段的交替游戏)。证据使用连续的平滑式最佳反应的游戏游戏游戏游戏游戏游戏游戏游戏的模拟游戏游戏游戏游戏游戏,并且验证我们最常知的最佳学习程序。