As an algorithm based on deep reinforcement learning, Proximal Policy Optimization (PPO) performs well in many complex tasks and has become one of the most popular RL algorithms in recent years. According to the mechanism of penalty in surrogate objective, PPO can be divided into PPO with KL Divergence (KL-PPO) and PPO with Clip function(Clip-PPO). Clip-PPO is widely used in a variety of practical scenarios and has attracted the attention of many researchers. Therefore, many variations have also been created, making the algorithm better and better. However, as a more theoretical algorithm, KL-PPO was neglected because its performance was not as good as CliP-PPO. In this article, we analyze the asymmetry effect of KL divergence on PPO's objective function , and give the inequality that can indicate when the asymmetry will affect the efficiency of KL-PPO. Proposed PPO with Correntropy Induced Metric algorithm(CIM-PPO) that use the theory of correntropy(a symmetry metric method that was widely used in M-estimation to evaluate two distributions' difference)and applied it in PPO. Then, we designed experiments based on OpenAIgym to test the effectiveness of the new algorithm and compare it with KL-PPO and CliP-PPO.
翻译:作为基于深层强化学习的算法,Proximal政策优化(PPO)在许多复杂任务中表现良好,并已成为近年来最受欢迎的RL算法之一。根据代用目标的处罚机制,PPO可以分为PPO与KL Dvergence(KL-PPO)和PPPO(Clip-PPPO)的匹配功能(Clip-PPO)的匹配算法。Clip-PPPO被广泛用于各种实际情景,并吸引了许多研究人员的注意。因此,还创造了许多变异性,使算法变得更好、好。然而,KL-PPPO作为理论被忽略了,因为其性能不如CliP-PO-POP。在文章中,我们分析了KLLOL差异对CPO目标功能的偏差的不对称效应,并给出了不平等性,说明这种偏差何时会影响KL-PPO的效率。 与Correntropy Pentpropic 引算法(CIM-PPOP-PPO) 的拟议PRO推算法的理论理论,并广泛用于对后在IMPO中应用了KPO-PI-C-C-Proimal-Proimal-Sal-LA的两次测试方法的对比方法的对比。