The recent booming of entropy-regularized literature reveals that Kullback-Leibler (KL) regularization brings advantages to Reinforcement Learning (RL) algorithms by canceling out errors under mild assumptions. However, existing analyses focus on fixed regularization with a constant weighting coefficient and have not considered the case where the coefficient is allowed to change dynamically. In this paper, we study the dynamic coefficient scheme and present the first asymptotic error bound. Based on the dynamic coefficient error bound, we propose an effective scheme to tune the coefficient according to the magnitude of error in favor of more robust learning. On top of this development, we propose a novel algorithm: Geometric Value Iteration (GVI) that features a dynamic error-aware KL coefficient design aiming to mitigate the impact of errors on the performance. Our experiments demonstrate that GVI can effectively exploit the trade-off between learning speed and robustness over uniform averaging of constant KL coefficient. The combination of GVI and deep networks shows stable learning behavior even in the absence of a target network where algorithms with a constant KL coefficient would greatly oscillate or even fail to converge.
翻译:最新出现的成文法的蓬勃发展表明,Kullback-Leiber(KL)正规化(KL)通过取消轻度假设下的错误,为强化学习(RL)算法带来了优势。然而,现有的分析侧重于固定的正规化,同时具有恒定加权系数,而没有考虑允许系数动态变化的情况。在本文中,我们研究了动态系数办法,并提出了第一个无症状错误。根据动态系数错误约束,我们提出了一个有效的办法,根据差错程度调整系数,以利于更稳健的学习。除了这一发展外,我们提出了一个新的算法:几何值转换(GVI),该算法具有动态差觉察KL系数设计,目的是减轻错误对性能的影响。我们的实验表明,GVI能够有效地利用学习速度和稳健度之间的交易,而平均不变的KL系数是统一的。GVI和深网络的组合表明,即使在没有目标网络的情况下,与恒定的KL系数的算法也会大大的骨样,甚至无法趋同。