以行为为基础的强化学习中神经革命培训 (Behavior-based Neuroevolutionary Training in Reinforcement Learning)

In addition to their undisputed success in solving classical optimization problems, neuroevolutionary and population-based algorithms have become an alternative to standard reinforcement learning methods. However, evolutionary methods often lack the sample efficiency of standard value-based methods that leverage gathered state and value experience. If reinforcement learning for real-world problems with significant resource cost is considered, sample efficiency is essential. The enhancement of evolutionary algorithms with experience exploiting methods is thus desired and promises valuable insights. This work presents a hybrid algorithm that combines topology-changing neuroevolutionary optimization with value-based reinforcement learning. We illustrate how the behavior of policies can be used to create distance and loss functions, which benefit from stored experiences and calculated state values. They allow us to model behavior and perform a directed search in the behavior space by gradient-free evolutionary algorithms and surrogate-based optimization. For this purpose, we consolidate different methods to generate and optimize agent policies, creating a diverse population. We exemplify the performance of our algorithm on standard benchmarks and a purpose-built real-world problem. Our results indicate that combining methods can enhance the sample efficiency and learning speed for evolutionary approaches.

翻译：除了在解决传统优化问题方面无可争议的成功之外,神经进化和基于人口的算法已成为标准强化学习方法的替代方法,但进化方法往往缺乏标准价值方法的抽样效率,而标准价值方法正是利用收集的状态和价值经验。如果考虑对具有大量资源成本的现实问题加强学习,抽样效率是必不可少的。因此,希望用经验开发方法来增强进化算法,并带来宝贵的洞察力。这项工作提出了一种混合算法,将地形变化神经进化优化与基于价值的强化学习结合起来。我们说明了如何利用政策的行为来创造距离和损失功能,从而利用存储的经验和计算出的状态价值。这些方法使我们能够通过无梯度的进化算法和基于代位的优化来模拟行为,对行为空间进行定向搜索。为此,我们整合了产生和优化代理政策的不同方法,创造了多样化的人口。我们以标准基准和目的构建的现实世界问题作为我们的算法表现的范例。我们的结果表明,合并方法可以提高进化方法的抽样效率和学习速度。