Deep Reinforcement Learning (Deep RL) and Evolutionary Algorithm (EA) are two major paradigms of policy optimization with distinct learning principles, i.e., gradient-based v.s. gradient free. An appealing research direction is integrating Deep RL and EA to devise new methods by fusing their complementary advantages. However, existing works on combining Deep RL and EA have two common drawbacks: 1) the RL agent and EA agents learn their policies individually, neglecting efficient sharing of useful common knowledge; 2) parameter-level policy optimization guarantees no semantic level of behavior evolution for the EA side. In this paper, we propose Evolutionary Reinforcement Learning with Two-scale State Representation and Policy Representation (ERL-Re2), a novel solution to the aforementioned two drawbacks. The key idea of ERL-Re2 is two-scale representation: all EA and RL policies share the same nonlinear state representation while maintaining individual linear policy representations. The state representation conveys expressive common features of the environment learned by all the agents collectively; the linear policy representation provides a favorable space for efficient policy optimization, where novel behavior-level crossover and mutation operations can be performed. Moreover, the linear policy representation allows convenient generalization of policy fitness with the help of Policy-extended Value Function Approximator (PeVFA), further improving the sample efficiency of fitness estimation. The experiments on a range of continuous control tasks show that ERL-Re2 consistently outperforms strong baselines and achieves significant improvement over both its Deep RL and EA components.
翻译:深度强化学习(Deep RL)和进化演算法(EA)是政策优化的两个主要范例,具有不同的学习原则,即梯度对梯度和梯度自由。一个具有吸引力的研究方向正在将Deep RL和EA结合,通过发挥互补优势来设计新方法。然而,将Deep RL和EA相结合的现有工作有两个共同的缺点:1) RL代理和EA代理单独学习其政策,忽视有效分享有用的共同知识;2) 参数一级政策优化保证EA方面的行为演化不具有语义性。在本文中,我们提出用双尺度国家代表制和政策代表制(ERL-Re2)来进行进化强化学习,这是对上述两个提法的新的解决办法。ER-R2的主要想法是两个尺度:所有EA和REL政策在保持单个直线性政策代表制的同时,共享相同的非线性国家代表制。州代表制代表制传达了所有机构集体所了解的环境的明显共同特征;线性政策代表制为高效政策优化政策优化提供了有利的空间,其中新的行为阶层政策代表制能进行跨级的跨度的跨级政策。