用于可缩放控制的最大变异强化学习 (Maximum Mutation Reinforcement Learning for Scalable Control)

Advances in Reinforcement Learning (RL) have demonstrated data efficiency and optimal control over large state spaces at the cost of scalable performance. Genetic methods, on the other hand, provide scalability but depict hyperparameter sensitivity towards evolutionary operations. However, a combination of the two methods has recently demonstrated success in scaling RL agents to high-dimensional action spaces. Parallel to recent developments, we present the Evolution-based Soft Actor-Critic (ESAC), a scalable RL algorithm. We abstract exploration from exploitation by combining Evolution Strategies (ES) with Soft Actor-Critic (SAC). Through this lens, we enable dominant skill transfer between offsprings by making use of soft winner selections and genetic crossovers in hindsight and simultaneously improve hyperparameter sensitivity in evolutions using the novel Automatic Mutation Tuning (AMT). AMT gradually replaces the entropy framework of SAC allowing the population to succeed at the task while acting as randomly as possible, without making use of backpropagation updates. In a study of challenging locomotion tasks consisting of high-dimensional action spaces and sparse rewards, ESAC demonstrates improved performance and sample efficiency in comparison to the Maximum Entropy framework. Additionally, ESAC presents efficacious use of hardware resources and algorithm overhead. A complete implementation of ESAC can be found at karush17.github.io/esac-web/.

翻译：强化学习(RL)的进步显示了数据效率和对大型国家空间的最佳控制,以可缩放性表现为代价。另一方面,遗传方法提供了可缩放性,但描绘了进化操作的超光度敏感度。然而,两种方法的结合最近表明在将RL剂缩放到高维行动空间方面取得了成功。与最近的发展相平行,我们介绍了基于进化的Soft Acor-Critic(ESAC),一种可缩放的RL算法。我们通过将进化战略与软动作-Critical(SAC)相结合,从开发中抽取探索。我们通过这一镜头,通过利用后视中软赢家选择和遗传交叉性跨度,在进化过程中同时提高超光度敏感度,利用新颖的自动变色图(AMT),我们逐渐取代SAC的诱导框架,让民众在尽可能随机地采取行动的同时成功完成这项任务,同时不使用反向反演化更新的更新。在由高度行动空间行动空间和低度报酬的SEAC网络框架中,我们可以改进了ASC-Restal SAC SAGiral SABassampal SABres SABres SABrass SABres SABres Ex样样样样样样样样样样样样样样样样样样样样样能和SB 。