This paper combines ideas from Q-learning and fictitious play to define three reinforcement learning procedures which converge to the set of stationary mixed Nash equilibria in identical interest discounted stochastic games. First, we analyse three continuous-time systems that generalize the best-response dynamics defined by Leslie et al. for zero-sum discounted stochastic games. Under some assumptions depending on the system, the dynamics are shown to converge to the set of stationary equilibria in identical interest discounted stochastic games. Then, we introduce three analog discrete-time procedures in the spirit of Sayin et al. and demonstrate their convergence to the set of stationary equilibria using our results in continuous time together with stochastic approximation techniques. Some numerical experiments complement our theoretical findings.
翻译:本文结合了来自Q- 学习和假玩的想法, 确定了三个强化学习程序, 这三个程序与固定式混合型Nash 平衡( 固定式混合型 Nash 平衡制) 相融合, 在相同的利息折扣随机游戏中 。 首先, 我们分析三个连续时间系统, 将Leslie 等人定义的最佳反应动态( 即零和折扣随机游戏 ) 加以概括。 根据一些系统假设, 这些动态与固定式平衡( 固定式平衡制) 相融合, 在相同的利益折扣折扣随机游戏中 。 然后, 我们引入了三个类似独立时间程序( Sayin et al. ) 的精神, 并展示了它们与固定式平衡( sayin et al.) 的趋同, 利用我们连续时间的结果与随机近比技术, 一些数字实验补充了我们的理论发现 。