Deep Reinforcement Learning (RL) has emerged as a powerful paradigm for training neural policies to solve complex control tasks. However, these policies tend to be overfit to the exact specifications of the task and environment they were trained on, and thus do not perform well when conditions deviate slightly or when composed hierarchically to solve even more complex tasks. Recent work has shown that training a mixture of policies, as opposed to a single one, that are driven to explore different regions of the state-action space can address this shortcoming by generating a diverse set of behaviors, referred to as skills, that can be collectively used to great effect in adaptation tasks or for hierarchical planning. This is typically realized by including a diversity term - often derived from information theory - in the objective function optimized by RL. However these approaches often require careful hyperparameter tuning to be effective. In this work, we demonstrate that less widely-used neuroevolution methods, specifically Quality Diversity (QD), are a competitive alternative to information-theory-augmented RL for skill discovery. Through an extensive empirical evaluation comparing eight state-of-the-art algorithms (four flagship algorithms from each line of work) on the basis of (i) metrics directly evaluating the skills' diversity, (ii) the skills' performance on adaptation tasks, and (iii) the skills' performance when used as primitives for hierarchical planning; QD methods are found to provide equal, and sometimes improved, performance whilst being less sensitive to hyperparameters and more scalable. As no single method is found to provide near-optimal performance across all environments, there is a rich scope for further research which we support by proposing future directions and providing optimized open-source implementations.
翻译:深度强化学习(RL)已经成为训练神经策略解决复杂控制任务的强大范例。然而,这些策略往往会过度适应其训练和环境的具体规范,因此当条件稍微偏离或当被组合在一起以解决更复杂的任务时,它们的表现并不好。最近的工作表明,训练一组策略,而非单个策略,这些策略由不同的状态-操作空间探索驱动,可以通过生成一组行为(称为技能),这些技能可以在适应性任务或用于层次规划中共同使用,从而解决这些问题。这通常是通过在RL优化的目标函数中包含多样性项(通常源自信息论)来实现的。但是,这些方法通常需要仔细的超参数调整才能实现有效性。在本研究中,我们证明不太广泛使用的神经演化方法,特别是质量多样性(QD),是一个与信息理论增强RL相比用于发掘技能的竞争性替代方法。通过广泛的经验评估,比较八种最先进的算法(每个工作流程中的四个旗舰算法)的(i)直接评估技能多样性的度量,(ii)用于适应任务的技能表现以及(iii)技能用于层次规划时的表现;发现QD方法提供相等甚至更好的性能,同时对超参数更不敏感且更可扩展。由于没有单一方法在所有环境中提供接近最佳性能,因此有丰富的进一步研究的空间,我们通过提出未来方向和提供优化的开源实现来支持这一点。