Proximal Policy Optimization (PPO) is a highly popular policy-based deep reinforcement learning (DRL) approach. However, we observe that the homogeneous exploration process in PPO could cause an unexpected stability issue in the training phase. To address this issue, we propose PPO-UE, a PPO variant equipped with self-adaptive uncertainty-aware explorations (UEs) based on a ratio uncertainty level. The proposed PPO-UE is designed to improve convergence speed and performance with an optimized ratio uncertainty level. Through extensive sensitivity analysis by varying the ratio uncertainty level, our proposed PPO-UE considerably outperforms the baseline PPO in Roboschool continuous control tasks.
翻译:最佳政策优化(PPO)是一种以政策为基础的高度流行的深层强化学习(DRL)方法,然而,我们注意到PPO的同一探索过程可能会在培训阶段造成一个出乎意料的稳定问题。为了解决这一问题,我们提议PPO-UE(一个PPO-UE)变体,配有基于比例不确定性水平的自适应性不确定性-认知探索(Ues)。拟议的PPPO-UE(PPPO-UE)旨在以最佳比例不确定性水平提高趋同速度和性能。通过对比率不确定性水平的不同进行广泛的敏感性分析,我们提议的PPPO-UE大大超过罗博学校连续控制任务的基线PPO。