We introduce Contrastive Intrinsic Control (CIC), an algorithm for unsupervised skill discovery that maximizes the mutual information between state-transitions and latent skill vectors. CIC utilizes contrastive learning between state-transitions and skills to learn behavior embeddings and maximizes the entropy of these embeddings as an intrinsic reward to encourage behavioral diversity. We evaluate our algorithm on the Unsupervised Reinforcement Learning Benchmark, which consists of a long reward-free pre-training phase followed by a short adaptation phase to downstream tasks with extrinsic rewards. CIC substantially improves over prior methods in terms of adaptation efficiency, outperforming prior unsupervised skill discovery methods by 1.79x and the next leading overall exploration algorithm by 1.18x.
翻译:我们引入了一种不受监督的技能发现算法(CIC ), 这是一种不受监督的技能发现算法,它使国家过渡和潜在技能矢量之间的相互信息最大化。 CIC 利用国家过渡和技能之间的对比性学习,学习行为嵌入,并最大限度地增加这些嵌入的灵敏度,以此作为鼓励行为多样性的内在奖励。 我们评估了我们关于无人监督的强化学习基准的算法,该基准包括长期的无报酬培训前阶段,随后是短期的适应阶段,以适应具有外源效果的下游任务。 CIC 在适应效率方面大大改进了先前的方法,超过了1.79x的先前未监督的技能发现方法,而下一个主要的探索总体算法是1.18x的。