Recently, the seminal algorithms AlphaGo and AlphaZero have started a new era in game learning and deep reinforcement learning. While the achievements of AlphaGo and AlphaZero - playing Go and other complex games at super human level - are truly impressive, these architectures have the drawback that they are very complex and require high computational resources. Many researchers are looking for methods that are similar to AlphaZero, but have lower computational demands and are thus more easily reproducible. In this paper, we pick an important element of AlphaZero - the Monte Carlo Tree Search (MCTS) planning stage - and combine it with reinforcement learning (RL) agents. We wrap MCTS for the first time around RL n-tuple networks to create versatile agents that keep at the same time the computational demands low. We apply this new architecture to several complex games (Othello, ConnectFour, Rubik's Cube) and show the advantages achieved with this AlphaZero-inspired MCTS wrapper. In particular, we present results that this AlphaZero-inspired agent is the first one trained on standard hardware (no GPU or TPU) to beat the very strong Othello program Edax up to and including level 7 (where most other algorithms could only defeat Edax up to level 2).
翻译:最近,创世算法阿尔法戈和阿尔法泽罗在游戏学习和深层强化学习中开始了一个新时代。阿尔法戈和阿尔法泽罗在超人类层面玩游戏和其他复杂游戏的成绩确实令人印象深刻,但这些结构的缺点是它们非常复杂,需要高计算资源。许多研究人员正在寻找类似于阿尔法泽罗的方法,但计算需求较低,因此更容易复制。在本文中,我们选择了阿尔法泽罗的重要元素-蒙特卡洛树搜索(MCTS)规划阶段,并把它与强化学习(RL)代理结合起来。我们第一次在RL n-tules网络上包扎了MCTS,以创建能同时保持计算需求低的多功能代理。我们把这一新架构应用于几个复杂的游戏(Othello, ConnectFour,Rubik's Cube),并展示了与这个阿尔法泽罗启发的MCTS包装商(MCTS)相比所带来的好处。特别是,我们介绍的结果是,这个受阿尔法泽罗教的代理商是第一个在标准硬件水平上受过训练的,包括最强的顶级的顶级的顶级和顶级,包括顶级的顶级的顶级,直到顶级的顶级的顶级的顶级,直到顶级的顶级的顶级的顶级的顶级的顶级的顶级计算机。