In the past few years, a considerable amount of research has been dedicated to the exploitation of previous learning experiences and the design of Few-shot and Meta Learning approaches, in problem domains ranging from Computer Vision to Reinforcement Learning based control. A notable exception, where to the best of our knowledge, little to no effort has been made in this direction is Quality-Diversity (QD) optimization. QD methods have been shown to be effective tools in dealing with deceptive minima and sparse rewards in Reinforcement Learning. However, they remain costly due to their reliance on inherently sample inefficient evolutionary processes. We show that, given examples from a task distribution, information about the paths taken by optimization in parameter space can be leveraged to build a prior population, which when used to initialize QD methods in unseen environments, allows for few-shot adaptation. Our proposed method does not require backpropagation. It is simple to implement and scale, and furthermore, it is agnostic to the underlying models that are being trained. Experiments carried in both sparse and dense reward settings using robotic manipulation and navigation benchmarks show that it considerably reduces the number of generations that are required for QD optimization in these environments.
翻译:在过去几年里,大量研究致力于利用以往的学习经验,并设计了从计算机视野到强化学习控制等问题领域的微小和元学习方法,从计算机视野到强化学习控制,这是一个显著的例外,在这方面,我们最了解的情况是,在这方面很少甚至没有作出努力,即质量差异优化。QD方法被证明是处理强化学习中欺骗性微型和微量奖励的有效工具。然而,由于依赖内在抽样的低效率进化过程,这些方法仍然费用高昂。我们从任务分布中可以看出,可以利用在参数空间优化所走的道路的信息来建立先前的人口,而当在不见环境中启动QD方法时,可以进行几度调整。我们提出的方法不需要反向调整,执行和规模比较简单,而且对正在培训的基本模式来说是微不足道的。使用机器人操纵和导航基准在稀有和密集的奖励环境中进行的实验表明,它大大减少了在这些环境中进行QD优化所需的代数。