We develop theory and algorithms for average-reward on-policy Reinforcement Learning (RL). We first consider bounding the difference of the long-term average reward for two policies. We show that previous work based on the discounted return (Schulman et al., 2015; Achiam et al., 2017) results in a non-meaningful bound in the average-reward setting. By addressing the average-reward criterion directly, we then derive a novel bound which depends on the average divergence between the two policies and Kemeny's constant. Based on this bound, we develop an iterative procedure which produces a sequence of monotonically improved policies for the average reward criterion. This iterative procedure can then be combined with classic DRL (Deep Reinforcement Learning) methods, resulting in practical DRL algorithms that target the long-run average reward criterion. In particular, we demonstrate that Average-Reward TRPO (ATRPO), which adapts the on-policy TRPO algorithm to the average-reward criterion, significantly outperforms TRPO in the most challenging MuJuCo environments.
翻译:我们为政策强化学习开发了平均回报的理论和算法(RL) 。 我们首先考虑将两种政策的长期平均奖赏的差别加以约束。 我们显示,先前基于折扣回报的工作(Schulman等人,2015年;Achiam等人,2017年)导致平均回报环境中的无意义的约束。 通过直接处理平均回报标准, 我们然后得出一个新的约束, 取决于两种政策与Kemey的常数之间的平均差异。 基于这一约束, 我们开发了一个迭接程序, 产生一套单质改进的平均奖赏标准政策序列。 然后, 这种迭接程序可以与典型的DRL( 深入强化学习)方法相结合, 从而产生针对长期平均奖赏标准的实用的DRL算法。 特别是, 我们证明, 平均回报TRPO(ATRPO) 能够使政策TRPO算法与平均回报标准相适应。 基于这一约束标准, 在最具挑战的 MuJUCO环境中, 大大超出TRPO 。