It is challenging for reinforcement learning (RL) algorithms to succeed in real-world applications like financial trading and logistic system due to the noisy observation and environment shifting between training and evaluation. Thus, it requires both high sample efficiency and generalization for resolving real-world tasks. However, directly applying typical RL algorithms can lead to poor performance in such scenarios. Considering the great performance of ensemble methods on both accuracy and generalization in supervised learning (SL), we design a robust and applicable method named Ensemble Proximal Policy Optimization (EPPO), which learns ensemble policies in an end-to-end manner. Notably, EPPO combines each policy and the policy ensemble organically and optimizes both simultaneously. In addition, EPPO adopts a diversity enhancement regularization over the policy space which helps to generalize to unseen states and promotes exploration. We theoretically prove EPPO increases exploration efficacy, and through comprehensive experimental evaluations on various tasks, we demonstrate that EPPO achieves higher efficiency and is robust for real-world applications compared with vanilla policy optimization algorithms and other ensemble methods. Code and supplemental materials are available at https://seqml.github.io/eppo.
翻译:强化学习(RL)算法在现实世界应用中取得成功是困难的,例如金融贸易和后勤系统,因为培训和评估之间的观测和环境变化十分吵闹。因此,它需要高抽样效率和通用性,以便解决现实世界的任务。然而,直接应用典型的RL算法可能会在这种情景中造成不良的绩效。考虑到在监督学习(SL)中精度和概括化的混合方法的出色表现,我们设计了一种叫作 " 综合的优化政策(EPPO) " 的强有力和适用的方法,该方法以端到端的方式学习共同政策。值得注意的是,EPPO将每一项政策和政策都有机地结合起来,同时优化。此外,EPOPO对政策空间实行多样性加强规范,帮助向看不见的国家推广并促进探索。我们从理论上证明EPPO提高了探索效率,并通过对各项任务进行全面实验性评估,我们证明EPO与香草政策优化算法和其他高效率,对现实世界的应用是有力的。代码和补充材料在http://wwwplipqregio。