The use of a policy and a heuristic function for guiding search can be quite effective in adversarial problems, as demonstrated by AlphaGo and its successors, which are based on the PUCT search algorithm. While PUCT can also be used to solve single-agent deterministic problems, it lacks guarantees on its search effort and it can be computationally inefficient in practice. Combining the A* algorithm with a learned heuristic function tends to work better in these domains, but A* and its variants do not use a policy. Moreover, the purpose of using A* is to find solutions of minimum cost, while we seek instead to minimize the search loss (e.g., the number of search steps). LevinTS is guided by a policy and provides guarantees on the number of search steps that relate to the quality of the policy, but it does not make use of a heuristic function. In this work we introduce Policy-guided Heuristic Search (PHS), a novel search algorithm that uses both a heuristic function and a policy and has theoretical guarantees on the search loss that relates to both the quality of the heuristic and of the policy. We show empirically on the sliding-tile puzzle, Sokoban, and a puzzle from the commercial game `The Witness' that PHS enables the rapid learning of both a policy and a heuristic function and compares favorably with A*, Weighted A*, Greedy Best-First Search, LevinTS, and PUCT in terms of number of problems solved and search time in all three domains tested.
翻译:如阿尔法戈及其继任者所证明的那样,在对抗性问题上,使用政策和指导搜索的休眠功能可以相当有效地发挥作用,如阿尔法戈及其基于PUCT搜索算法的继任者所证明的那样。虽然PUCT也可以用于解决单一代理人的决定性问题,但PUCT在搜索努力方面缺乏保障,在实际操作上也可能是低效率的。将A*算法与一个有学识的休眠功能结合起来,往往在这些方面效果更好,但A*及其变种并不使用政策。此外,使用A* 的目的是找到最低成本的解决办法,而我们则设法尽量减少搜索损失(例如搜索步骤的数目)。虽然PUCT也可以用来解决单一代理人的确定性问题,但PUCT在搜索步骤中缺乏保障,但实际上没有使用一种超常功能。在这项工作中,我们引入了政策引导的休眠搜索(PHUS),一种新式的搜索算法,既使用超常的搜索功能,又使用所有的政策,对搜索损失的理论作出保证,这既关系到SUILErictrial* 和快速政策的质量,我们从迷性政策学的迷性学习了SIS的迷性和快速的奥学。我们从SHIS的奥地展示的奥学学学,在三个的奥地展示的奥学和摩的理论上展示了。