Consider the problem of training robustly capable agents. One approach is to generate a diverse collection of agent polices. Training can then be viewed as a quality diversity (QD) optimization problem, where we search for a collection of performant policies that are diverse with respect to quantified behavior. Recent work shows that differentiable quality diversity (DQD) algorithms greatly accelerate QD optimization when exact gradients are available. However, agent policies typically assume that the environment is not differentiable. To apply DQD algorithms to training agent policies, we must approximate gradients for performance and behavior. We propose two variants of the current state-of-the-art DQD algorithm that compute gradients via approximation methods common in reinforcement learning (RL). We evaluate our approach on four simulated locomotion tasks. One variant achieves results comparable to the current state-of-the-art in combining QD and RL, while the other performs comparably in two locomotion tasks. These results provide insight into the limitations of current DQD algorithms in domains where gradients must be approximated. Source code is available at https://github.com/icaros-usc/dqd-rl
翻译:一种做法是生成多样化的代理政策。 然后,培训可以被视为一个质量多样性(QD)优化问题,我们在此寻找一系列在量化行为方面各不相同的性能政策。最近的工作表明,在精确的梯度存在时,不同的质量多样性(DQD)算法大大加快了QD优化。然而,代理政策通常假定环境是不可区别的。为了将DQD算法应用于培训代理政策,我们必须为业绩和行为估算梯度。我们建议了当前状态-艺术DQD算法的两个变种,即通过强化学习中常见的近似法计算梯度(RL)。我们评估了四种模拟移动任务的方法。一个变种在将QD和RL相结合时取得了与当前状态相近的结果,而另一个变种则在两种locomotion任务中具有可比较性。这些结果使人们深入了解了当前DQD算法在必须接近梯度的域中的局限性。源代码可在 https://grus/rusbica/ 源代码上查阅。