We propose a novel reinforcement learning algorithm,QD-RL, that incorporates the strengths of off-policy RL algorithms into Quality Diversity (QD) approaches. Quality-Diversity methods contribute structural biases by decoupling the search for diversity from the search for high return, resulting in efficient management of the exploration-exploitation trade-off. However, these approaches generally suffer from sample inefficiency as they call upon evolutionary techniques. QD-RL removes this limitation by relying on off-policy RL algorithms. More precisely, we train a population of off-policy deep RL agents to simultaneously maximize diversity inside the population and the return of the agents. QD-RL selects agents from the diversity-return Pareto Front, resulting in stable and efficient population updates. Our experiments on the Ant-Maze environment show that QD-RL can solve challenging exploration and control problems with deceptive rewards while being more than 15 times more sample efficient than its evolutionary counterparts.
翻译:我们提出一种新的强化学习算法QD-RL,将政策外RL算法的优点纳入质量多样性(QD)方法。质量多样性方法通过将寻求多样性与寻求高回报的努力脱钩,从而促成结构性偏差,从而导致对勘探-开发交易的有效管理。然而,这些方法一般都因采样效率低下而受到影响,因为它们需要进化技术。QD-RL依靠非政策性RL算法来消除这一限制。更确切地说,我们培训了一组政策外深度RL代理商,以便同时最大限度地实现人口内部的多样性和代理人的返回。QD-RL从多样性-回报Pareto阵线选择了代理商,从而导致稳定和高效地更新人口。我们在Ant-Mez环境的实验表明,QD-RL能够解决挑战性的勘探和控制问题,同时比进化对应方高出15倍以上。