A tree-based online search algorithm iteratively simulates trajectories and updates action-values of a set of states stored in a tree structure. It works reasonably well in practice but fails to take advantage of the information gathered from similar states. Depending upon the smoothness of the action-value function, a simple way to interpolate information among similar states is to perform online learning; policy gradient search provides a practical algorithm to achieve this. However, policy gradient search does not have an explicit exploration mechanism, which is present in tree-based online search algorithms. In this paper, we propose an efficient and effective online search algorithm, named Exploratory Policy Gradient Search (ExPoSe), that leverages information sharing among states by directly updating the search policy parameters while following a well-defined exploration mechanism during the online search. We conduct experiments on several decision-making problems, including Atari games, Sokoban and Hamiltonian cycle search in sparse graphs and show that ExPoSe consistently outperforms popular online search algorithms across all domains.
翻译:以树为基础的在线搜索算法迭接模拟了树结构中储存的一组状态的轨迹并更新了它们的行动价值。 它在实践上运作得相当好,但未能利用从类似国家收集的信息。 根据行动价值功能的顺利性,在类似国家中进行信息互调的简单方法就是进行在线学习;政策梯度搜索提供了实现这一点的实用算法。然而,政策梯度搜索没有明确的探索机制,而这种机制存在于以树为基础的在线搜索算法中。在本文中,我们提出一个高效有效的在线搜索算法,名为探索政策梯度搜索(ExPOSe),它通过直接更新搜索政策参数,同时在网上搜索过程中遵循一个明确界定的探索机制来利用各国之间的信息共享。我们就一些决策问题进行了实验,包括Atari游戏、Sokoban和汉密尔顿循环搜索,在稀有的图表中进行,并显示ExPOSe一贯超越了所有领域受欢迎的在线搜索算法。