Proximal Policy Optimization (PPO) is among the most widely used algorithms in reinforcement learning, which achieves state-of-the-art performance in many challenging problems. The keys to its success are the reliable policy updates through the clipping mechanism and the multiple epochs of minibatch updates. The aim of this research is to give new simple but effective alternatives to the former. For this, we propose linearly and exponentially decaying clipping range approaches throughout the training. With these, we would like to provide higher exploration at the beginning and stronger restrictions at the end of the learning phase. We investigate their performance in several classical control and locomotive robotic environments. During the analysis, we found that they influence the achieved rewards and are effective alternatives to the constant clipping method in many reinforcement learning tasks.
翻译:最佳政策优化(PPO)是强化学习中最广泛使用的算法之一,它在许多具有挑战性的问题中取得了最先进的表现。成功的关键在于通过剪切机制和微型批量更新的多重时代进行可靠的政策更新。这项研究的目的是为前者提供新的简单而有效的替代方法。在这方面,我们在整个培训中建议了线性和指数性衰减剪切除范围方法。有了这些算法,我们希望在学习阶段开始时提供更高的探索,在学习阶段结束时提供更严格的限制。我们调查它们在一些古典控制和机车机器人环境中的表现。在分析过程中,我们发现它们影响已经获得的收益,是许多强化学习任务中不断剪切除方法的有效替代方法。