The recent remarkable progress of deep reinforcement learning (DRL) stands on regularization of policy for stable and efficient learning. A popular method, named proximal policy optimization (PPO), has been introduced for this purpose. PPO clips density ratio of the latest and baseline policies with a threshold, while its minimization target is unclear. As another problem of PPO, the symmetric threshold is given numerically while the density ratio itself is in asymmetric domain, thereby causing unbalanced regularization of the policy. This paper therefore proposes a new variant of PPO by considering a regularization problem of relative Pearson (RPE) divergence, so-called PPO-RPE. This regularization yields the clear minimization target, which constrains the latest policy to the baseline one. Through its analysis, the intuitive threshold-based design consistent with the asymmetry of the threshold and the domain of density ratio can be derived. Through four benchmark tasks, PPO-RPE performed as well as or better than the conventional methods in terms of the task performance by the learned policy.
翻译:最近深入强化学习(DRL)的显著进展在于使政策正规化,以实现稳定和高效的学习。为此采用了一种流行方法,称为近似政策优化(PPO),它使最新政策和基线政策的密度比率达到一个门槛,而其最小化的目标则不明确。作为PPO的另一个问题,对称阈值在数字上给出,而密度比率本身处于不对称范围,从而造成政策不平衡的正规化。因此,本文件提出PPO的一个新的变式,即考虑相对皮尔逊差异的正规化问题,即所谓的PPPO-RPE。这种正规化产生了明确的最小化目标,将最新政策限制在基线一级。通过分析,可以得出与门槛和密度比率范围不对称相一致的直观阈值设计。通过四项基准任务,PPO-RPE在所学政策的任务业绩方面既表现了常规方法,也比常规方法更好或更好。