Reinforcement learning (RL) algorithms have been successfully applied to a range of challenging sequential decision making and control tasks. In this paper, we classify RL into direct and indirect RL according to how they seek the optimal policy of the Markov decision process problem. The former solves the optimal policy by directly maximizing an objective function using gradient descent methods, in which the objective function is usually the expectation of accumulative future rewards. The latter indirectly finds the optimal policy by solving the Bellman equation, which is the sufficient and necessary condition from Bellman's principle of optimality. We study policy gradient forms of direct and indirect RL and show that both of them can derive the actor-critic architecture and can be unified into a policy gradient with the approximate value function and the stationary state distribution, revealing the equivalence of direct and indirect RL. We employ a Gridworld task to verify the influence of different forms of policy gradient, suggesting their differences and relationships experimentally. Finally, we classify current mainstream RL algorithms using the direct and indirect taxonomy, together with other ones including value-based and policy-based, model-based and model-free.
翻译:强化学习(RL)算法被成功地应用于一系列具有挑战性的相继决策和控制任务。在本文中,我们根据RL如何寻求Markov决策程序问题的最佳政策,将RL分为直接和间接RL。前者通过使用梯度下降法直接实现客观功能最大化,从而解决最佳政策,其中目标功能通常是累积未来回报的预期。后者通过解决Bellman方程式间接地找到了最佳政策,而Bellman方程式是来自Bellman最佳性原则的充足和必要条件。我们研究了直接和间接RL的政策梯度形式,并表明这两种形式都能够产生行为者-批评结构,并可以统一成一种政策梯度,具有近似值函数和固定状态分布,揭示直接和间接RL的等值。我们利用Gridworld任务来核查不同政策梯度形式的影响,从实验角度提出它们的差异和关系。最后,我们用直接和间接的分类法将当前的主流RL算法与其他包括基于价值和政策的、基于模型和无模式的分类法。