We study two-player zero-sum stochastic games, and propose a form of independent learning dynamics called Doubly Smoothed Best-Response dynamics, which integrates a discrete and doubly smoothed variant of the best-response dynamics into temporal-difference (TD)-learning and minimax value iteration. The resulting dynamics are payoff-based, convergent, rational, and symmetric among players. Our main results provide finite-sample guarantees. In particular, we prove the first-known $\tilde{\mathcal{O}}(1/\epsilon^2)$ sample complexity bound for payoff-based independent learning dynamics, up to a smoothing bias. In the special case where the stochastic game has only one state (i.e., matrix games), we provide a sharper $\tilde{\mathcal{O}}(1/\epsilon)$ sample complexity. Our analysis uses a novel coupled Lyapunov drift approach to capture the evolution of multiple sets of coupled and stochastic iterates, which might be of independent interest.
翻译:我们研究玩家零和零和随机游戏,并提出一种名为 Doubly 滑滑式最佳反应动态的独立学习动态,它将最佳反应动态的离散和双滑式变体整合到时间差异(TD)学习和微负值迭代中。由此产生的动态是基于报酬的、趋同的、理性的和对称的玩家。我们的主要结果提供了有限的抽样保证。特别是,我们证明第一种为人所知的 $\tilde\mathcal{O<unk> (1/\ epsilon<unk> 2) 样本复杂度为基于报酬的独立学习动态所约束,直至一种平滑的偏差。在这种特殊情况下,当随机游戏只有一个状态(即矩阵游戏)时,我们提供了一种尖锐的 $\ titilde_mathcal{O<unk> ( 1/\\ epsilon) $样本复杂性。我们的分析使用了一种新型的结合的Lyapunov 漂流方法来捕捉多种组合和随机性外星体的演变,可能具有独立的兴趣。</s>