Policy optimization methods are popular reinforcement learning algorithms, because their incremental and on-policy nature makes them more stable than the value-based counterparts. However, the same properties also make them slow to converge and sample inefficient, as the on-policy requirement precludes data reuse and the incremental updates couple large iteration complexity into the sample complexity. These characteristics have been observed in experiments as well as in theory in the recent work of~\citet{agarwal2020pc}, which provides a policy optimization method PCPG that can robustly find near optimal polices for approximately linear Markov decision processes but suffers from an extremely poor sample complexity compared with value-based techniques. In this paper, we propose a new algorithm, COPOE, that overcomes the sample complexity issue of PCPG while retaining its robustness to model misspecification. Compared with PCPG, COPOE makes several important algorithmic enhancements, such as enabling data reuse, and uses more refined analysis techniques, which we expect to be more broadly applicable to designing new reinforcement learning algorithms. The result is an improvement in sample complexity from $\widetilde{O}(1/\epsilon^{11})$ for PCPG to $\widetilde{O}(1/\epsilon^3)$ for PCPG, nearly bridging the gap with value-based techniques.
翻译:政策优化方法是流行的强化学习算法,因为它们的递增和政策性质使得它们比基于价值的对应方更加稳定。然而,同样的特性也使得它们趋同和取样效率低下的速度缓慢,因为政策要求排除了数据再利用和递增更新的双重迭代复杂性,使抽样复杂程度与样本复杂程度相加。这些特征在实验中以及在“citet{agarwal2020pc}”的近期工作中的理论中都观察到。 “citet{agarwal2020pc}”提供了一种政策优化方法PCPG,为大约线性Markov决策程序提供了近乎最佳的策略,但与基于价值的技术相比,样本复杂性极差。在本文中,我们提出了一个新的算法,即COPOE,它克服了五氯苯酚的样本复杂性问题,同时保持了模型性强的特性。与PCPGG相比,COPO在一些重要的算法改进了几件重要的算法,如数据再利用,并使用更精细的分析技术,我们预期这些方法将更广泛地适用于设计新的强化学习算法。其样本复杂性从$\全局{O}(1/\Q_G_G_Blick_G_G_CCPCPCPCPCPC_Bl_B_B_G_G_G_G_G_G_G_lG_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G_G