The AlphaZero algorithm and its successor MuZero have revolutionised several competitive strategy games, including chess, Go, and shogi and video games like Atari, by learning to play these games better than any human and any specialised computer program. Aside from knowing the rules, AlphaZero had no prior knowledge of each game. This dramatically advanced progress on a long-standing AI challenge to create programs that can learn for themselves from first principles. Theoretically, there are well-known limits to the power of deep learning for strategy games like chess, Go, and shogi, as they are known to be NEXPTIME hard. Some papers have argued that the AlphaZero methodology has limitations and is unsuitable for general AI. However, none of these works has suggested any specific limits for any particular game. In this paper, we provide more powerful bottlenecks than previously suggested. We present the first concrete example of a game - namely the (children) game of nim - and other impartial games that seem to be a stumbling block for AlphaZero and similar reinforcement learning algorithms. We show experimentally that the bottlenecks apply to both the policy and value networks. Since solving nim can be done in linear time using logarithmic space i.e. has very low-complexity, our experimental results supersede known theoretical limits based on many games' PSPACE (and NEXPTIME) completeness. We show that nim can be learned on small boards, but when the board size increases, AlphaZero style algorithms rapidly fail to improve. We quantify the difficulties for various setups, parameter settings and computational resources. Our results might help expand the AlphaZero self-play paradigm by allowing it to use meta-actions during training and/or actual game play like applying abstract transformations, or reading and writing to an external memory.
翻译:AlphaZero 算法及其继承者MuZero 通过学习比任何人类和任何专门计算机程序更好的玩这些游戏, 阿尔法Zero 算法及其继承者MuZero 已经革命了数种有竞争力的战略游戏, 包括象棋、 Go 和像Atari那样的Shogi 和视频游戏。 一些论文指出, AlphaZero 方法比任何人类和任何专门的计算机程序都更精通这些游戏。 但是, 除了了解规则之外, AlphaZero 没有对每个游戏有先入为主的知识。 这个长期的 AI 挑战在创建能够从最初的原则中为自己学习的程序上取得了巨大进步。 从理论上讲, 深层次学习象象棋、 Go 和 shogi这样的战略游戏的实力是众所周知的极限。 我们实验性地指出, 阿尔法Zemotero 方法有局限性, 也不适合通用的 AI 。 然而, 这些作品中没有任何一项具体的限制 。 在实验性 IM 中, 我们的游戏会中, 可以使用一个已知的机变变的机变的机变 。