Recent techniques for approximating Nash equilibria in very large games leverage neural networks to learn approximately optimal policies (strategies). One promising line of research uses neural networks to approximate counterfactual regret minimization (CFR) or its modern variants. DREAM, the only current CFR-based neural method that is model free and therefore scalable to very large games, trains a neural network on an estimated regret target that can have extremely high variance due to an importance sampling term inherited from Monte Carlo CFR (MCCFR). In this paper we propose an unbiased model-free method that does not require any importance sampling. Our method, ESCHER, is principled and is guaranteed to converge to an approximate Nash equilibrium with high probability in the tabular case. We show that the variance of the estimated regret of a tabular version of ESCHER with an oracle value function is significantly lower than that of outcome sampling MCCFR and tabular DREAM with an oracle value function. We then show that a deep learning version of ESCHER outperforms the prior state of the art -- DREAM and neural fictitious self play (NFSP) -- and the difference becomes dramatic as game size increases.
翻译:在大型游戏中近距离接近纳什平衡的近代技术使神经网络能够学习近乎最佳的政策(战略)。有希望的研究线之一利用神经网络来近似反事实遗憾最小化(CFR)或其现代变异物。DREAM是目前唯一基于CFR的无型神经系统方法,其模型是免费的,因此可以伸缩到非常大型的游戏中,根据从蒙特卡洛·CFR(MCCFR)继承的重要抽样术语而可能产生极大差异的估计遗憾目标,对神经网络进行了培训。在本文中,我们提出了一种不要求任何重要抽样的不带偏见的无模式方法。我们的方法,即ESCHER(ESCHER)是原则性的,并保证在表格中将接近于近似于纳什平衡的纳什平衡,其概率很高。我们表明,带有一个骨骼值功能的ESCHER(ESCHER)表版的估计遗憾差异大大低于结果取样中中CFRRRR和表DREAM(具有星值功能的表DREAM)的估计数。我们然后表明,ESCHER的深学习版本超越了艺术的前状态 -- -- DREAM和神经假自游戏(NUSSP)的大小变化变得剧烈的差异。