For continuing environments, reinforcement learning methods commonly maximize a discounted reward criterion with discount factor close to 1 in order to approximate the steady-state reward (the gain). However, such a criterion only considers the long-run performance, ignoring the transient behaviour. In this work, we develop a policy gradient method that optimizes the gain, then the bias (which indicates the transient performance and is important to capably select from policies with equal gain). We derive expressions that enable sampling for the gradient of the bias, and its preconditioning Fisher matrix. We further propose an algorithm that solves the corresponding bi-level optimization using a logarithmic barrier. Experimental results provide insights into the fundamental mechanisms of our proposal.
翻译:对于持续环境,强化学习方法通常会尽量扩大贴现奖励标准,贴现系数接近1,以接近稳定状态奖励(收益),然而,这种标准只考虑长期业绩,而忽略短暂行为。在这项工作中,我们开发了一种政策梯度方法,优化收益,然后是偏差(表明短暂业绩,对以同等收益从政策中可任意选择很重要),我们从中得出一些表达方式,能够抽样选择偏差梯度,以及作为渔业矩阵的先决条件。我们进一步提出一种算法,用对数屏障解决相应的双层优化。实验结果为我们提案的基本机制提供了深刻的见解。