采样效率-质量-多样性优化的多样化政策进展 (Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization)

A fascinating aspect of nature lies in its ability to produce a large and diverse collection of organisms that are all high-performing in their niche. By contrast, most AI algorithms focus on finding a single efficient solution to a given problem. Aiming for diversity in addition to performance is a convenient way to deal with the exploration-exploitation trade-off that plays a central role in learning. It also allows for increased robustness when the returned collection contains several working solutions to the considered problem, making it well-suited for real applications such as robotics. Quality-Diversity (QD) methods are evolutionary algorithms designed for this purpose. This paper proposes a novel algorithm, QD - PG , which combines the strength of Policy Gradient algorithms and Quality Diversity approaches to produce a collection of diverse and high-performing neural policies in continuous control environments. The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that exploits information at the time-step level to thrive policies towards more diversity in a sample-efficient manner. Specifically, QD - PG selects neural controllers from a MAP - E lites grid and uses two gradient-based mutation operators to improve both quality and diversity, resulting in stable population updates. Our results demonstrate that QD - PG generates collections of diverse solutions that solve challenging exploration and control problems while being two orders of magnitude more sample-efficient than its evolutionary competitors.

翻译：自然的一个迷人的方面在于它能够产生大量和多样化的生物体集集,而这些生物体集集在它们所处的位置上的业绩都很高。相比之下,大多数AI 算法都侧重于寻找一个单一有效的解决办法来解决一个特定问题。除了业绩外,追求多样性是处理勘探-开发交易的便利方式,在学习中起着核心作用。当返回的收集包含一些解决所考虑问题的工作办法时,还能够提高稳健性,使它适合于机器人等实际应用。质量-多样性(QD)方法就是为此目的设计的演化算法。本文提出了一种创新的算法,即QD-PG,它将“政策梯度”算法和“质量多样化”方法结合起来,在持续控制环境中产生多样化和高绩效的神经政策。这项工作的主要贡献是引入“多样性政策梯度(DPG)”,它利用时间档一级的信息促进政策发展,以抽样效率的方式使政策更加多样化。具体地说,基于QG选择的神经级算法(QG)将“政策”的强度与“质量”方法结合起来,同时从一个稳定的“G”变异性操作器(G)的运行者使用一个稳定的“变异变式”到一个稳定的“数据序列,从而展示的“变换成“电子”的“变式”系统,从而显示一个稳定的“变式的“变式”的“变式”的“变压式”的“变式”的“变式”的“变式”的“变式”的“变式”的“变式“变式“变式”的“变式”的“变式”的“变式的“变式”的“变式”的“变式的“变式的“变式的“变式的“变式的“变式控制器”的“变式“变式的“变式”的“变式的“变式”的“变式”的“变式控制器”系统”结果”的“变式的“变式的“变式的“变式的“变式的“变式的“变式的“变式的“变式”进行的“变式”的“变式”的“变式”进行的“变式”的“变式的“变式的“变式的“变式的“变式的“变式”