Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose DOMiNO, a method for Diversity Optimization Maintaining Near Optimality. We formalize the problem as a Constrained Markov Decision Process where the objective is to find diverse policies, measured by the distance between the state occupancies of the policies in the set, while remaining near-optimal with respect to the extrinsic reward. We demonstrate that the method can discover diverse and meaningful behaviors in various domains, such as different locomotion patterns in the DeepMind Control Suite. We perform extensive analysis of our approach, compare it with other multi-objective baselines, demonstrate that we can control both the quality and the diversity of the set via interpretable hyperparameters, and show that the discovered set is robust to perturbations.
翻译:针对同一问题寻找不同的解决方案是智能中与创造性和适应新情况相关的一个关键方面。在强化学习中,一套不同的政策可以用于探索、转移、等级和稳健性。我们提出DOMINO,这是多样性优化维持接近最佳程度的一种方法。我们将此问题正式化为一个 Constract Markov 决策程序,其目标是找到多样化的政策,根据这套政策在州间所占的距离来衡量,同时在极端奖励方面保持接近最佳状态。我们证明,该方法可以发现不同领域的不同和有意义的行为,例如深质控制套件中的不同移动模式。我们对我们的方法进行了广泛分析,将其与其他多目标基线进行比较,表明我们能够通过可解释的超参数来控制集的质量和多样性,并表明所发现的集能抵御扰动。