Recently, the seminal algorithms AlphaGo and AlphaZero have started a new era in game learning and deep reinforcement learning. While the achievements of AlphaGo and AlphaZero - playing Go and other complex games at super human level - are truly impressive, these architectures have the drawback that they require high computational resources. Many researchers are looking for methods that are similar to AlphaZero, but have lower computational demands and are thus more easily reproducible. In this paper, we pick an important element of AlphaZero - the Monte Carlo Tree Search (MCTS) planning stage - and combine it with temporal difference (TD) learning agents. We wrap MCTS for the first time around TD n-tuple networks and we use this wrapping only at test time to create versatile agents that keep at the same time the computational demands low. We apply this new architecture to several complex games (Othello, ConnectFour, Rubik's Cube) and show the advantages achieved with this AlphaZero-inspired MCTS wrapper. In particular, we present results that this agent is the first one trained on standard hardware (no GPU or TPU) to beat the very strong Othello program Edax up to and including level 7 (where most other learning-from-scratch algorithms could only defeat Edax up to level 2).
翻译:最近,创世算法阿尔法戈和阿尔法泽罗在游戏学习和深层强化学习阶段开始了一个新的时代。阿尔法戈和阿尔法泽罗的成就 — 在超人类层面玩游戏的阿尔法戈和阿尔法泽罗的成就确实令人印象深刻,但这些结构的缺点在于它们需要高计算资源。许多研究人员正在寻找类似于阿尔法泽罗的方法,但计算要求较低,因此更容易复制。在本文中,我们选择了阿尔法泽罗的重要元素 — 蒙特卡洛树搜索( MCTS), 并且把它与时间差异( TD) 学习代理( TD) 融合起来。 我们第一次将 MCTS 环绕着TD n- textle 网络, 我们只是在测试时间使用这个包包来创建能保持计算需求低的多功能代理。 我们把这个新架构应用到几个复杂的游戏( Othello, ConnectFour Four, Rubik's Cubbe), 并展示了这个阿尔法泽罗树搜索( MCTS) 包件的优势。我们展示的结果是第一个在标准硬件( 包括最大GPU或TRUP) 直至 AS 7 级的顶级上, 。