Consider a walking agent that must adapt to damage. To approach this task, we can train a collection of policies and have the agent select a suitable policy when damaged. Training this collection may be viewed as a quality diversity (QD) optimization problem, where we search for solutions (policies) which maximize an objective (walking forward) while spanning a set of measures (measurable characteristics). Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available for the objective and measures. However, such gradients are typically unavailable in RL settings due to non-differentiable environments. To apply DQD in RL settings, we propose to approximate objective and measure gradients with evolution strategies and actor-critic methods. We develop two variants of the DQD algorithm CMA-MEGA, each with different gradient approximations, and evaluate them on four simulated walking tasks. One variant achieves comparable performance (QD score) with the state-of-the-art PGA-MAP-Elites in two tasks. The other variant performs comparably in all tasks but is less efficient than PGA-MAP-Elites in two tasks. These results provide insight into the limitations of CMA-MEGA in domains that require rigorous optimization of the objective and where exact gradients are unavailable.
翻译:考虑一个必须适应损害的行走代理物。 要完成这项任务, 我们可以对一组政策进行培训, 并且让代理物在受损时选择合适的政策。 培训该集合物可以被视为一个质量多样性( QD) 优化问题, 我们在此寻找解决方案( 政策), 最大限度地实现目标( 前进), 并跨越一系列措施( 可计量特征 ) 。 最近的工作显示, 不同的质量多样性( DQD) 算法( QQD) 大大加快了 QD 优化, 当目标和措施有精确的梯度时 。 然而, 由于环境不同, 在 RL 设置中, 此类梯度通常无法在 RL 环境中使用 。 为了在 RL 设置中应用 DQD, 我们提议以进化战略和行为方略方法来将目标梯度( QD) 进行近似度( QD) 。 我们开发了两种DQD 算法的变式, 每个梯度近似近似, 并评估四项模拟行走法任务。 一种变式在两种任务中取得与PGAMA- MAP- 的状态相似性任务中, 其他变式在不透明化领域执行这些任务时, 级任务需要精确的CAMAPGAAAAAA- 的精确度限制。