PPO (Proximal Policy Optimization) algorithm has demonstrated excellent performance in many fields, and it is considered as a simple version of TRPO (Trust Region Policy Optimization) algorithm. However, the ratio clipping operation in PPO may not always effectively enforce the trust region constraints, this can be a potential factor affecting the stability of the algorithm. In this paper, we propose SPO (Simple Policy Optimization) algorithm, which introduces a novel clipping method for KL divergence between the old and current policies. SPO can effectively enforce the trust region constraints in almost all environments, while still maintaining the simplicity of a first-order algorithm. Comparative experiments in Atari 2600 environments show that SPO sometimes provides stronger performance than PPO. Code is available at https://github.com/MyRepositories-hub/Simple-Policy-Optimization.
翻译:暂无翻译