Monte-Carlo planning, as exemplified by Monte-Carlo Tree Search (MCTS), has demonstrated remarkable performance in applications with finite spaces. In this paper, we consider Monte-Carlo planning in an environment with continuous state-action spaces, a much less understood problem with important applications in control and robotics. We introduce POLY-HOOT, an algorithm that augments MCTS with a continuous armed bandit strategy named Hierarchical Optimistic Optimization (HOO) (Bubeck et al., 2011). Specifically, we enhance HOO by using an appropriate polynomial, rather than logarithmic, bonus term in the upper confidence bounds. Such a polynomial bonus is motivated by its empirical successes in AlphaGo Zero (Silver et al., 2017b), as well as its significant role in achieving theoretical guarantees of finite space MCTS (Shah et al., 2019). We investigate, for the first time, the regret of the enhanced HOO algorithm in non-stationary bandit problems. Using this result as a building block, we establish non-asymptotic convergence guarantees for POLY-HOOT: the value estimate converges to an arbitrarily small neighborhood of the optimal value function at a polynomial rate. We further provide experimental results that corroborate our theoretical findings.
翻译:Monte-Carlo规划,如Monte-Carlo树搜索(MCTS)所展示的那样,在有限的空间应用中表现出了显著的绩效。在本文中,我们认为Monte-Carlo规划是在一个连续的州行动空间环境中进行的,在控制和机器人的重要应用中,这是一个远不为人理解的问题。我们引入了POLY-HOOT算法,这一算法以名为 " 高阶乐观优化 " (HOOO)(Bubeck等人,2011年)。具体地说,我们通过使用一个适当的多数值而非对数,提高HOOO值。我们利用这一结果作为建筑块,在阿尔法戈泽罗(Silver等人,2017年b)的成功经验激励了这种多数值,以及它在实现有限空间 MCTS(Shah等人,2019年)理论保证方面发挥的重要作用。我们首次调查了在非固定带问题中增强HOO值的算法的遗憾。我们用这一结果进一步确定了一个小的建筑块,我们为PLOO-OO的理论趋同率的理论结果提供了一种最接近率的数值。