This paper addresses a new interpretation of reinforcement learning (RL) as reverse Kullback-Leibler (KL) divergence optimization, and derives a new optimization method using forward KL divergence. Although RL originally aims to maximize return indirectly through optimization of policy, the recent work by Levine has proposed a different derivation process with explicit consideration of optimality as stochastic variable. This paper follows this concept and formulates the traditional learning laws for both value function and policy as the optimization problems with reverse KL divergence including optimality. Focusing on the asymmetry of KL divergence, the new optimization problems with forward KL divergence are derived. Remarkably, such new optimization problems can be regarded as optimistic RL. That optimism is intuitively specified by a hyperparameter converted from an uncertainty parameter. In addition, it can be enhanced when it is integrated with prioritized experience replay and eligibility traces, both of which accelerate learning. The effects of this expected optimism was investigated through learning tendencies on numerical simulations using Pybullet. As a result, moderate optimism accelerated learning and yielded higher rewards. In a realistic robotic simulation, the proposed method with the moderate optimism outperformed one of the state-of-the-art RL method.
翻译:本文将强化学习(RL)的新解释作为反 Kullback- Leiberr (KL) 差异优化,并使用前方 KL 差异推出一种新的优化方法。虽然RL 最初的目的是通过优化政策实现间接回报的最大化,但Levine最近的工作提出了一个不同的衍生过程,其中明确考虑到最佳性作为随机变量。本文遵循了这一概念,并将价值函数和政策的传统学习法律作为包括最佳性在内的反向 KL 差异优化问题来制定。侧重于 KL 差异的不对称性,因此产生了前方 KL 差异的新优化问题。值得注意的是,这种新的优化问题可以被视为乐观的RL。这种乐观是用从不确定性参数转换出来的超参数直线性地说明的。此外,如果它与优先的经验再演练和资格轨迹相结合,两者都会加速学习。通过学习使用 Pybullet 的数字模拟的学习趋势来调查这一预期乐观的效果。结果是,适度的乐观加速学习,并产生更高的奖赏。在现实的机器人模拟中,拟议的方法以中度乐观方式取代了状态。