Recent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a very strong performance bound, the very first to combine two desirable aspects: a linear dependency to the horizon (instead of quadratic) and an error propagation term involving an averaging effect of the estimation errors (instead of an accumulation effect). We also study the more general case of an additional entropy regularizer. The resulting abstract scheme encompasses many existing RL algorithms. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study.
翻译:最近利用Kullback-Leiber (KL) 正规化作为核心组成部分的强化学习算法(RL) 最近的使用 Kullback-Leiber (KL) 正规化作为核心组成部分的算法表现出了出色的表现。 然而,理论上对KL 正规化迄今为止的帮助作用知之甚少。 我们在一个近似值迭代方案范围内研究KL 正规化,并表明它隐含了平均值。 我们利用了这一洞察力,提供了非常强大的实绩约束, 首先是结合了两个可取的方面: 对地平线的线性依赖性( 而不是对二次曲线的依赖性), 以及一个错误传播术语, 涉及估计误差的平均效果( 而不是累积效应 ) 。 我们还研究了增加一个增制正弦化器的更一般案例。 由此产生的抽象方案包含许多现有的RL 算法。 我们的一些假设并不与神经网络相容, 因此我们用广泛的实验研究来补充这一理论分析。