In this work, we adapt a training approach inspired by the original AlphaGo system to play the imperfect information game of Reconnaissance Blind Chess. Using only the observations instead of a full description of the game state, we first train a supervised agent on publicly available game records. Next, we increase the performance of the agent through self-play with the on-policy reinforcement learning algorithm Proximal Policy Optimization. We do not use any search to avoid problems caused by the partial observability of game states and only use the policy network to generate moves when playing. With this approach, we achieve an ELO of 1330 on the RBC leaderboard, which places our agent at position 27 at the time of this writing. We see that self-play significantly improves performance and that the agent plays acceptably well without search and without making assumptions about the true game state.
翻译:在这项工作中,我们调整了由最初的阿尔法戈系统所启发的培训方法,以玩不完善的侦察盲人象棋信息游戏。我们仅使用观察而不是对游戏状态的完整描述,我们首先对一个受监督的代理人进行公开的游戏记录培训。接下来,我们通过在政策强化学习算法中自我玩耍来提高该代理人的性能。我们不使用任何搜索来避免游戏状态部分可视性所造成的问题,而只是在玩游戏时使用政策网络来产生动作。通过这种方法,我们实现了在RBC领导板上1330年的ELO,这使得我们的代理人在撰写本文时处于第27位。我们看到,自我玩耍极大地提高了业绩,而且代理人在不进行搜索和不对真实游戏状态进行假设的情况下表现得令人接受。