In this paper we discuss policy iteration methods for approximate solution of a finite-state discounted Markov decision problem, with a focus on feature-based aggregation methods and their connection with deep reinforcement learning schemes. We introduce features of the states of the original problem, and we formulate a smaller "aggregate" Markov decision problem, whose states relate to the features. The optimal cost function of the aggregate problem, a nonlinear function of the features, serves as an architecture for approximation in value space of the optimal cost function or the cost functions of policies of the original problem. We discuss properties and possible implementations of this type of aggregation, including a new approach to approximate policy iteration. In this approach the policy improvement operation combines feature-based aggregation with reinforcement learning based on deep neural networks, which is used to obtain the needed features. We argue that the cost function of a policy may be approximated much more accurately by the nonlinear function of the features provided by aggregation, than by the linear function of the features provided by deep reinforcement learning, thereby potentially leading to more effective policy improvement.
翻译:在本文中,我们讨论了大约解决有限状态折扣的Markov决定问题的政策迭代方法,重点是基于特征的汇总方法及其与深层强化学习计划的联系;我们介绍了最初问题状态的特点,我们制定了较小规模的“聚合”Markov决定问题,这些问题的状态与特征有关;综合问题的最佳成本功能,即这些特征的非线性功能,是接近最佳成本功能或原始问题政策的成本功能的价值空间的架构;我们讨论了这类聚合的特性和可能的实施,包括近似政策迭代的新方法;在这种方式中,政策改进业务将基于特征的汇总与基于深层神经网络的强化学习结合起来,用于获取所需的特征;我们说,一项政策的成本功能可能比深层强化学习所提供的特征的线性功能更精确得多,从而可能导致更有效的政策改进。