For continuing environments, reinforcement learning (RL) methods commonly maximize the discounted reward criterion with discount factor close to 1 in order to approximate the average reward (the gain). However, such a criterion only considers the long-run steady-state performance, ignoring the transient behaviour in transient states. In this work, we develop a policy gradient method that optimizes the gain, then the bias (which indicates the transient performance and is important to capably select from policies with equal gain). We derive expressions that enable sampling for the gradient of the bias and its preconditioning Fisher matrix. We further devise an algorithm that solves the gain-then-bias (bi-level) optimization. Its key ingredient is an RL-specific logarithmic barrier function. Experimental results provide insights into the fundamental mechanisms of our proposal.
翻译:对于持续环境,强化学习方法通常以接近1的贴现系数使贴现奖励标准最大化,以接近平均奖励(收益)。然而,这一标准只考虑长期稳定状态的表现,而忽略短暂状态的瞬态行为。在这项工作中,我们开发了一种政策梯度方法,优化收益,然后是偏差(表明短暂业绩,对于以同等收益从政策中选择政策很重要),我们从中得出一些表达方式,能够抽样选择偏差的梯度及其作为渔业矩阵的先决条件。我们进一步设计了一种算法,解决利差(双级)的优化。其关键成分是一个针对RL的对数屏障功能。实验结果为我们提案的基本机制提供了深刻的见解。