We present fictitious play dynamics for stochastic games and analyze its convergence properties in zero-sum stochastic games. Our dynamics involves players forming beliefs on opponent strategy and their own continuation payoff (Q-function), and playing a greedy best response using estimated continuation payoffs. Players update their beliefs from observations of opponent actions. A key property of the learning dynamics is that update of the beliefs on Q-functions occurs at a slower timescale than update of the beliefs on strategies. We show both in the model-based and model-free cases (without knowledge of player payoff functions and state transition probabilities), the beliefs on strategies converge to a stationary mixed Nash equilibrium of the zero-sum stochastic game.
翻译:我们在零和零和随机游戏中展示模拟游戏的游戏动态,并分析其趋同特性。我们的动态涉及玩家形成对对手策略和他们自己的持续回报(Q功能)的信念,以及利用估计的继续回报(Q功能)来发挥贪婪的最佳反应。玩家根据对对手动作的观察来更新其信念。学习动态的一个关键特性是更新关于“功能”的信念的时间范围比更新战略理念的时间范围要慢。我们在基于模型的和没有模型的案例中(不知道玩家的支付功能和国家过渡概率)都显示了关于“战略”的信念与零和“随机游戏”的固定混合的“纳什平衡 ” 。