In this paper we analyze the qualitative differences between evolutionary strategies and reinforcement learning algorithms by focusing on two popular state-of-the-art algorithms: the OpenAI-ES evolutionary strategy and the Proximal Policy Optimization (PPO) reinforcement learning algorithm -- the most similar methods of the two families. We analyze how the methods differ with respect to: (i) general efficacy, (ii) ability to cope with sparse rewards, (iii) propensity/capacity to discover minimal solutions, (iv) dependency on reward shaping, and (v) ability to cope with variations of the environmental conditions. The analysis of the performance and of the behavioral strategies displayed by the agents trained with the two methods on benchmark problems enable us to demonstrate qualitative differences which were not identified in previous studies, to identify the relative weakness of the two methods, and to propose ways to ameliorate some of those weakness. We show that the characteristics of the reward function has a strong impact which vary qualitatively not only for the OpenAI-ES and the PPO but also for alternative reinforcement learning algorithms, thus demonstrating the importance of optimizing the characteristic of the reward function to the algorithm used.
翻译:在本文中,我们分析进化战略与强化学习算法之间的质量差异,侧重于两种流行的最新算法:OpenAI-ES进化战略和Proximal政策优化强化学习算法(PPO),这是两个家庭最相似的方法。我们分析了方法在下列方面的不同之处:(一) 一般功效,(二) 应付微薄奖励的能力,(三) 发现最起码解决办法的倾向/能力,(四) 依赖奖赏塑造,(五) 应付环境条件变化的能力。对业绩的分析以及接受过两种基准问题方法培训的代理人所显示的行为战略的分析,使我们能够显示在以往研究中未查明的质差异,找出两种方法的相对弱点,并就如何改善其中的一些弱点提出办法。我们表明,奖励职能的特点具有很大的影响,不仅对开放AI-ES和PPO具有质上的差异,而且对于其他强化学习算法也具有很大影响,从而表明必须优化奖赏功能的特征与所使用的算法相符。