The policy gradient method enjoys the simplicity of the objective where the agent optimizes the cumulative reward directly. Moreover, in the continuous action domain, parameterized distribution of action distribution allows easy control of exploration, resulting from the variance of the representing distribution. Entropy can play an essential role in policy optimization by selecting the stochastic policy, which eventually helps better explore the environment in reinforcement learning (RL). However, the stochasticity often reduces as the training progresses; thus, the policy becomes less exploratory. Additionally, certain parametric distributions might only work for some environments and require extensive hyperparameter tuning. This paper aims to mitigate these issues. In particular, we propose an algorithm called Robust Policy Optimization (RPO), which leverages a perturbed distribution. We hypothesize that our method encourages high-entropy actions and provides a way to represent the action space better. We further provide empirical evidence to verify our hypothesis. We evaluated our methods on various continuous control tasks from DeepMind Control, OpenAI Gym, Pybullet, and IsaacGym. We observed that in many settings, RPO increases the policy entropy early in training and then maintains a certain level of entropy throughout the training period. Eventually, our agent RPO shows consistently improved performance compared to PPO and other techniques: entropy regularization, different distributions, and data augmentation. Furthermore, in several settings, our method stays robust in performance, while other baseline mechanisms fail to improve and even worsen the performance.
翻译:政策梯度方法可以简化目标,使代理商直接优化累积奖赏。此外,在连续行动领域,行动分布的参数化参数化可以容易控制勘探,因为代表分布有差异。 Etropy 可以通过选择随机化政策在政策优化中发挥重要作用,最终有助于更好地探索强化学习环境(RL)。然而,随着培训进展,随机性往往会减少,因此,政策探索性会减少。此外,某些参数分布可能只对某些环境有效,需要广泛的超参数调整。本文旨在缓解这些问题。特别是,我们提议了一个名为 Robust 政策优化(RPO)的算法,该算法可以利用周旋分布。我们假设,我们的方法会鼓励高精度的学习(RL)。然而,我们进一步提供经验证据,以核实我们的假设。我们评估了我们从DeepMind Controcurt、O Gymm、Pybl和IsacergyGym 等连续控制任务的方法。我们发现,在许多环境中,RPO 将业绩升级到其他周期的周期,将不断改进的运行规则性规则。