Policy gradient (PG) is a reinforcement learning (RL) approach that optimizes a parameterized policy model for an expected return using gradient ascent. Given a well-parameterized policy model, such as a neural network model, with appropriate initial parameters, the PG algorithms work well even when environment does not have the Markov property. Otherwise, they can be trapped on a plateau or suffer from peakiness effects. As another successful RL approach, algorithms based on Monte-Carlo Tree Search (MCTS), which include AlphaZero, have obtained groundbreaking results especially on the board game playing domain. They are also suitable to be applied to non-Markov decision processes. However, since the standard MCTS does not have the ability to learn state representation, the size of the tree-search space can be too large to search. In this work, we examine a mixture policy of PG and MCTS to complement each other's difficulties and take advantage of them. We derive conditions for asymptotic convergence with results of a two-timescale stochastic approximation and propose an algorithm that satisfies these conditions. The effectivity of the proposed methods is verified through numerical experiments on non-Markov decision processes.
翻译:政策梯度 (PG) 是一种强化学习(RL) 方法,它优化了一个参数化的政策模型,以使用梯度来对预期返回进行优化。鉴于一个参数化的政策模型,例如神经网络模型,具有适当的初始参数,即使环境没有Markov属性,PG算法也运作良好。否则,它们可能被困在高原上,或受到高峰效应的影响。作为另一个成功的RL方法,基于蒙特-卡洛树搜索(MCTS)的算法,包括AlphaZero在内,已经取得了突破性的结果,特别是在游戏游戏域域中。它们也适合适用于非Markov决定程序。然而,由于标准的 MCTS不具备学习状态代表能力,因此树搜索空间的大小可能太大,无法搜索。在这项工作中,我们检查PG和MCTS的混合政策,以补充对方的困难并利用这些困难。我们得出了与两度规模的随机近似结果的不协调条件,并提出一个符合这些条件的算法。拟议方法的效果是通过不进行数字实验来验证。