Finding different solutions to the same problem is a key aspect of intelligence associated with creativity and adaptation to novel situations. In reinforcement learning, a set of diverse policies can be useful for exploration, transfer, hierarchy, and robustness. We propose Diverse Successive Policies, a method for discovering policies that are diverse in the space of Successor Features, while assuring that they are near optimal. We formalize the problem as a Constrained Markov Decision Process (CMDP) where the goal is to find policies that maximize diversity, characterized by an intrinsic diversity reward, while remaining near-optimal with respect to the extrinsic reward of the MDP. We also analyze how recently proposed robustness and discrimination rewards perform and find that they are sensitive to the initialization of the procedure and may converge to sub-optimal solutions. To alleviate this, we propose new explicit diversity rewards that aim to minimize the correlation between the Successor Features of the policies in the set. We compare the different diversity mechanisms in the DeepMind Control Suite and find that the type of explicit diversity we are proposing is important to discover distinct behavior, like for example different locomotion patterns.
翻译:与创造性和适应新情况相关的智力的一个关键方面是寻找与同一问题不同的解决办法。在强化学习中,一套不同的政策可以用于探索、转移、等级和稳健性。我们提出了不同的接续政策,这是发现成功地物空间多样化政策的一种方法,同时确保这些政策接近最佳。我们把问题正式化为一个受控制马可夫决策程序,其目标是找到使多样性最大化的政策,其特点是内在的多样性奖赏,同时在MDP的外部奖赏方面保持接近最佳。我们还分析了最近提出的稳健性和歧视奖赏是如何表现的,发现它们对程序的初始化十分敏感,并可能与次优化的解决办法汇合。为了缓解这一点,我们提出了新的明确的多样性奖赏,目的是最大限度地减少这套政策成功地物之间的联系。我们比较了深精控制套中不同的多样性机制,发现我们提议的明确多样性类型对于发现不同行为很重要,例如不同的巡回模式。