The combination of deep reinforcement learning and search at both training and test time is a powerful paradigm that has led to a number of successes in single-agent settings and perfect-information games, best exemplified by AlphaZero. However, prior algorithms of this form cannot cope with imperfect-information games. This paper presents ReBeL, a general framework for self-play reinforcement learning and search that provably converges to a Nash equilibrium in any two-player zero-sum game. In the simpler setting of perfect-information games, ReBeL reduces to an algorithm similar to AlphaZero. Results in two different imperfect-information games show ReBeL converges to an approximate Nash equilibrium. We also show ReBeL achieves superhuman performance in heads-up no-limit Texas hold'em poker, while using far less domain knowledge than any prior poker AI.
翻译:在培训和测试时间进行深入强化学习和搜索的结合是一种强有力的范例,它导致在单一试剂设置和完美信息游戏中取得了一些成功,最好以AlphaZero为范例。然而,这种形式的先前算法无法应付不完善的信息游戏。本文展示了ReBEL,这是自我增强学习和搜索的一般框架,在任何两边零和游戏中可以与纳什平衡相融合。在更简单的完美信息游戏设置中,ReBEL降低为类似于AlphaZero的算法。两个不完美的信息游戏的结果显示,ReBEL接近了纳什平衡。我们还显示,ReBEL在不限制德克萨斯州人头顶住的扑克牌中取得了超人性的表现,同时使用远少于以往任何扑克牌AI的域知识。