We introduce Contrastive Intrinsic Control (CIC), an algorithm for unsupervised skill discovery that maximizes the mutual information between skills and state transitions. In contrast to most prior approaches, CIC uses a decomposition of the mutual information that explicitly incentivizes diverse behaviors by maximizing state entropy. We derive a novel lower bound estimate for the mutual information which combines a particle estimator for state entropy to generate diverse behaviors and contrastive learning to distill these behaviors into distinct skills. We evaluate our algorithm on the Unsupervised Reinforcement Learning Benchmark, which consists of a long reward-free pre-training phase followed by a short adaptation phase to downstream tasks with extrinsic rewards. We find that CIC substantially improves over prior unsupervised skill discovery methods and outperforms the next leading overall exploration algorithm in terms of downstream task performance.
翻译:我们引入了一种不受监督的技能发现算法(CIC ), 这个算法可以使技能和国家转型之间的相互信息最大化。 与大多数先前的做法不同, CIC 使用一种对相互信息的分解,通过最大限度地增加国家酶激素来明确激励不同行为。 我们为相互信息得出一种新的较低约束的估算值,这种估算值将国家酶粒子测算器结合在一起,以产生不同的行为和对比性学习,将这些行为转化为不同的技能。 我们评估了我们关于未经监督的强化学习基准的算法,该基准由长期的无报酬培训前阶段组成,随后是短期的适应阶段,以适应具有极端效果的下游任务。 我们发现, CIC 大大改进了先前未受监督的技能发现方法,在下游任务性表现方面比下一个领先的总体探索算法要强得多。