AlphaGo, AlphaGo Zero, and all of their derivatives can play with superhuman strength because they are able to predict the win-lose outcome with great accuracy. However, Go as a game is decided by a final score difference, and in final positions AlphaGo plays suboptimal moves: this is not surprising, since AlphaGo is completely unaware of the final score difference, all winning final positions being equivalent from the winrate perspective. This can be an issue, for instance when trying to learn the "best" move or to play with an initial handicap. Moreover, there is the theoretical quest of the "perfect game", that is, the minimax solution. Thus, a natural question arises: is it possible to train a successful Reinforcement Learning agent to predict score differences instead of winrates? No empirical or theoretical evidence can be found in the literature to support the folklore statement that "this does not work". In this paper we present Leela Zero Score, a software designed to support or disprove the "does not work" statement. Leela Zero Score is designed on the open-source solution known as Leela Zero, and is trained on a 9x9 board to predict score differences instead of winrates. We find that the training produces a rational player, and we analyze its style against a strong amateur human player, to find that it is prone to some mistakes when the outcome is close. We compare its strength against SAI, an AlphaGo Zero-like software working on the 9x9 board, and find that the training of Leela Zero Score has reached a premature convergence to a player weaker than SAI.
翻译:阿尔法戈、 阿尔法戈、 阿尔法戈零 及其所有衍生物都可以以超人的力量来发挥超人的力量, 因为他们能够非常准确地预测“ 赢赢赢” 的结果。 然而, 以游戏的方式去, 由最后的分数差异来决定, 在最后的位置上, AlfaGo 播放最不完美的动作: 这并不令人惊讶, 因为阿尔法戈完全不知道最后的分数差异, 所有最后的得分都从赢的胜率都与赢率的观点相等。 这可能会是一个问题, 比如当他们试图学习“ 最佳” 的动作或以初始障碍来玩耍的时候。 此外, “ 完美游戏” (即迷你马克思解决方案) 的理论追求。 因此, 一个自然的问题出现: 能否训练一个成功的加强学习工具来预测差额而不是赢率? 因为阿尔法戈德完全不知道最后的得分数, 所有最后的得分都与赢率相等。 在本文中, Leela Zero 旨在支持或破坏“ 不起作用的” 的常态声明。 Leela Zero 评分被设计成一个“ ” 。 在公开的解决方案上被设计成为“ Leela Zex 比较一个“ ” 一种“ Leela” 的比, 我们发现一个“ 赢率的得一个“ ” 和“ 赢率” 。