We propose a new reinforcement learning algorithm derived from a regularized linear-programming formulation of optimal control in MDPs. The method is closely related to the classic Relative Entropy Policy Search (REPS) algorithm of Peters et al. (2010), with the key difference that our method introduces a Q-function that enables efficient exact model-free implementation. The main feature of our algorithm (called QREPS) is a convex loss function for policy evaluation that serves as a theoretically sound alternative to the widely used squared Bellman error. We provide a practical saddle-point optimization method for minimizing this loss function and provide an error-propagation analysis that relates the quality of the individual updates to the performance of the output policy. Finally, we demonstrate the effectiveness of our method on a range of benchmark problems.
翻译:我们建议采用新的强化学习算法,这种算法来自对MDPs最佳控制进行正规化线性线性方案设计。这种方法与Peters等人(2010年)的经典相对肠道政策搜索算法(REPS)密切相关,关键区别在于我们的方法引入了能够高效、完全无模型执行的Q功能。我们的算法(称为QREPS)的主要特征是政策评价的螺旋损失函数,它是一种理论上合理的替代广泛使用的平方贝尔曼错误的方法。我们为最大限度地减少这一损失功能提供了实用的支撑点优化方法,并提供了与单个更新的质量与产出政策绩效相联系的错误分析。最后,我们展示了我们在一系列基准问题上的方法的有效性。