Mean-field games have been used as a theoretical tool to obtain an approximate Nash equilibrium for symmetric and anonymous $N$-player games in literature. However, limiting applicability, existing theoretical results assume variations of a "population generative model", which allows arbitrary modifications of the population distribution by the learning algorithm. Instead, we show that $N$ agents running policy mirror ascent converge to the Nash equilibrium of the regularized game within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ samples from a single sample trajectory without a population generative model, up to a standard $\mathcal{O}(\frac{1}{\sqrt{N}})$ error due to the mean field. Taking a divergent approach from literature, instead of working with the best-response map we first show that a policy mirror ascent map can be used to construct a contractive operator having the Nash equilibrium as its fixed point. Next, we prove that conditional TD-learning in $N$-agent games can learn value functions within $\tilde{\mathcal{O}}(\varepsilon^{-2})$ time steps. These results allow proving sample complexity guarantees in the oracle-free setting by only relying on a sample path from the $N$ agent simulator. Furthermore, we demonstrate that our methodology allows for independent learning by $N$ agents with finite sample guarantees.
翻译:平面游戏被作为一种理论工具,用于在文学中为对称和匿名美元玩家游戏获取大约纳什平衡的理论工具。然而,由于限制适用性,现有的理论结果假设了“人口基因模型”的变异,允许学习算法任意修改人口分布。相反,我们显示,在$\tilde\mathcal{O ⁇ (\varepsilon ⁇ -2})范围内运行政策镜正统游戏纳什平衡的“美元代理商”与“美元”内以纳什平衡为固定点的常规游戏的纳什平衡相融合。接下来,我们证明,在没有人口基因模型的情况下,从单一样本中学习到标准$$\mathcal{O}(grac{1unsqrt{N})美元错误。采取与文学不同的方法,而不是与最佳反应图一起工作。我们首先显示,政策镜像可以用来构建一个以纳什平衡为固定点的契约操作者。我们以美元试金游戏中有条件的TD-学习功能在$\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\