Reinforcement learning (RL) problems over general state and action spaces are notoriously challenging. In contrast to the tableau setting, one can not enumerate all the states and then iteratively update the policies for each state. This prevents the application of many well-studied RL methods especially those with provable convergence guarantees. In this paper, we first present a substantial generalization of the recently developed policy mirror descent method to deal with general state and action spaces. We introduce new approaches to incorporate function approximation into this method, so that we do not need to use explicit policy parameterization at all. Moreover, we present a novel policy dual averaging method for which possibly simpler function approximation techniques can be applied. We establish linear convergence rate to global optimality or sublinear convergence to stationarity for these methods applied to solve different classes of RL problems under exact policy evaluation. We then define proper notions of the approximation errors for policy evaluation and investigate their impact on the convergence of these methods applied to general-state RL problems with either finite-action or continuous-action spaces. To the best of our knowledge, the development of these algorithmic frameworks as well as their convergence analysis appear to be new in the literature.
翻译:在一般状态和行动空间方面,强化学习(RL)问题具有臭名昭著的挑战性。与表框设置不同,我们无法列举所有各州,然后反复更新每个州的政策。这妨碍了许多研究周全的RL方法的应用,特别是那些具有可证实的趋同保证的RL方法的应用。在本文件中,我们首先对最近开发的处理一般状态和行动空间的政策镜底回归方法进行大量概括。我们引入了将功能近似纳入这一方法的新办法,这样我们根本不需要使用明确的政策参数化。此外,我们提出了一种新的双均分法政策,可能可以采用更简单的功能近似技术。我们建立了线性趋同率,以达到全球最佳性或次线性趋同率,用于在精确的政策评价中解决不同类别的RL问题。然后我们界定了政策评价的近似错误的适当概念,并调查这些方法对适用于一般状态RL问题与有限行动或连续行动空间的趋同的影响。我们最了解的是,这些算法框架的发展及其趋同分析似乎在文献中是新的。