Reinforcement learning has recently been used to approach well-known NP-hard combinatorial problems in graph theory. Among these problems, Hamiltonian cycle problems are exceptionally difficult to analyze, even when restricted to individual instances of structurally complex graphs. In this paper, we use Monte Carlo Tree Search (MCTS), the search algorithm behind many state-of-the-art reinforcement learning algorithms such as AlphaZero, to create autonomous agents that learn to play the game of Snake, a game centered on properties of Hamiltonian cycles on grid graphs. The game of Snake can be formulated as a single-player discounted Markov Decision Process (MDP) where the agent must behave optimally in a stochastic environment. Determining the optimal policy for Snake, defined as the policy that maximizes the probability of winning - or win rate - with higher priority and minimizes the expected number of time steps to win with lower priority, is conjectured to be NP-hard. Performance-wise, compared to prior work in the Snake game, our algorithm is the first to achieve a win rate over $0.5$ (a uniform random policy achieves a win rate $< 2.57 \times 10^{-15}$), demonstrating the versatility of AlphaZero in approaching NP-hard environments.
翻译:强化学习最近被用于在图形理论中处理众所周知的NP硬型组合学问题,其中,汉密尔顿周期问题特别难以分析,即使仅限于结构复杂的图表的个别情况。在本文中,我们使用许多最先进的强化学习算法(如阿尔法泽罗)背后的搜索算法蒙特卡洛树搜索(MCTS),即许多最先进的强化学习算法(如阿尔法泽罗)的搜索算法,以创建学习玩蛇游戏的自主代理器,而蛇的游戏以汉密尔顿周期在网格图中的特点为中心。蛇的游戏可以被设计成单一玩家折扣马可夫决定程序(MDP ), 经纪人必须在结构相近的环境中最优化地行事。确定蛇的最佳政策是尽可能提高赢率( 或赢率 ), 并尽可能优先地将预期的赢赢取时间限制在较低优先级别上, 。 与以前在红蛇游戏中的工作相比, 我们的算法是第一个达到超过0.5美元的赢率( 统一随机政策在最高水平上接近A+15) 。