采样效率-质量-多样性优化的多样化政策进展 (Diversity Policy Gradient for Sample Efficient Quality-Diversity Optimization)

from arxiv, Add several baselines (Policy Gradient assisted MAP Elites, DIAYN, AGAC) Change writing to take the point of view of the evo community Change style, writing, explanation, figures

A fascinating aspect of nature lies in its ability to produce a large and diverse collection of organisms that are all high-performing in their niche. By contrast, most AI algorithms focus on finding a single efficient solution to a given problem. Aiming for diversity in addition to performance is a convenient way to deal with the exploration-exploitation trade-off that plays a central role in learning. It also allows for increased robustness when the returned collection contains several working solutions to the considered problem, making it well-suited for real applications such as robotics. Quality-Diversity (QD) methods are evolutionary algorithms designed for this purpose. This paper proposes a novel algorithm, QDPG, which combines the strength of Policy Gradient algorithms and Quality Diversity approaches to produce a collection of diverse and high-performing neural policies in continuous control environments. The main contribution of this work is the introduction of a Diversity Policy Gradient (DPG) that exploits information at the time-step level to drive policies towards more diversity in a sample-efficient manner. Specifically, QDPG selects neural controllers from a MAP-Elites grid and uses two gradient-based mutation operators to improve both quality and diversity. Our results demonstrate that QDPG is significantly more sample-efficient than its evolutionary competitors.

翻译：自然的一个迷人的方面在于它能够产生大量和多样化的生物集集,而这些生物集集在它们所处的位置上表现都非常出色。相比之下,大多数AI算法都侧重于寻找一个单一有效的解决办法来解决一个特定问题。除了业绩外,追求多样性是处理勘探-开发交易的便利方式,在学习中起着核心作用。当返回的收集包含一些解决所考虑问题的工作办法时,还能够提高稳健性,使它适合于机器人等实际应用。质量-多样性(QD)方法就是为此设计的演进算法。具体地说,QDPG从一个新型算法(QDPG)中选择了神经控制器,它结合了政策分级算法和质量多样化方法的力量,以便在持续的控制环境中产生多样化和高绩效的神经政策。这项工作的主要贡献是引入一个多样化政策梯度(DPG),在时间档一级利用信息推动政策以抽样效率的方式实现更多样化。具体地说,QDPG从一个政策级化算器中选择了一个新的算法,即将政策分级算法和质量-Qreagial-Q-trapheral-tra-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G-G