Recent algorithms designed for reinforcement learning tasks focus on finding a single optimal solution. However, in many practical applications, it is important to develop reasonable agents with diverse strategies. In this paper, we propose Diversity-Guided Policy Optimization (DGPO), an on-policy framework for discovering multiple strategies for the same task. Our algorithm uses diversity objectives to guide a latent code conditioned policy to learn a set of diverse strategies in a single training procedure. Specifically, we formalize our algorithm as the combination of a diversity-constrained optimization problem and an extrinsic-reward constrained optimization problem. And we solve the constrained optimization as a probabilistic inference task and use policy iteration to maximize the derived lower bound. Experimental results show that our method efficiently finds diverse strategies in a wide variety of reinforcement learning tasks. We further show that DGPO achieves a higher diversity score and has similar sample complexity and performance compared to other baselines.
翻译:为加强学习任务而设计的最近算法侧重于寻找一个单一的最佳解决方案。然而,在许多实际应用中,重要的是要发展具有不同战略的合理代理商。在本文中,我们提出了多样性指导政策优化(DGPO),这是一个为同一任务发现多种战略的在线政策框架。我们的算法利用多样性目标指导一个潜在的代码条件政策,以学习一套单一培训程序的不同战略。具体地说,我们将我们的算法正式化,将多样性限制的优化问题和极端回报限制的优化问题结合起来。我们还解决了有限的优化作为概率性推论任务,并利用政策迭代法最大限度地实现衍生的较低约束。实验结果表明,我们的方法在广泛的强化学习任务中高效地找到了不同的战略。我们进一步表明,DGPO取得了更高的多样性得分,与其他基线相比具有相似的样本复杂性和性。