This paper addresses a new interpretation of the traditional optimization method in reinforcement learning (RL) as optimization problems using reverse Kullback-Leibler (KL) divergence, and derives a new optimization method using forward KL divergence, instead of reverse KL divergence in the optimization problems. Although RL originally aims to maximize return indirectly through optimization of policy, the recent work by Levine has proposed a different derivation process with explicit consideration of optimality as stochastic variable. This paper follows this concept and formulates the traditional learning laws for both value function and policy as the optimization problems with reverse KL divergence including optimality. Focusing on the asymmetry of KL divergence, the new optimization problems with forward KL divergence are derived. Remarkably, such new optimization problems can be regarded as optimistic RL. That optimism is intuitively specified by a hyperparameter converted from an uncertainty parameter. In addition, it can be enhanced when it is integrated with prioritized experience replay and eligibility traces, both of which accelerate learning. The effects of this expected optimism was investigated through learning tendencies on numerical simulations using Pybullet. As a result, moderate optimism accelerated learning and yielded higher rewards. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.
翻译:本文用反向 Kullback- Leiber (KL) 差异,将传统的强化学习优化方法(RL) 解释为优化问题,用逆向 Kullback- Leiber (KL) 差异,并产生一种新的优化方法,使用前方 KL 差异,而不是优化问题中的逆向 KL 差异,从而产生一种新的优化方法。虽然RL最初的目的是通过优化政策间接地实现最大回报最大化,但Levine最近的工作提出了不同的推导过程,明确考虑作为随机变量的最佳性。本文遵循了这一概念,并将价值观功能和政策的传统学习法律作为包括最佳性在内的逆向 KL差异的最大问题。重点是KL差异的不对称,从而产生了与前方 KL 差异相关的新的优化问题。值得注意的是,这类新的优化问题可以被视为乐观的RL。这种乐观是用从不确定性参数转换出来的超比参数不切实际的描述的。此外,如果将这种乐观与优先的经验再演练和资格轨迹结合起来,则可以加强。通过使用Pybullet 学习数字模拟来调查这种预期的乐观的影响。作为学习趋势的结果,通过中度的中度的乐观取一种现实的模拟, 和高利的模拟方法。