This paper presents a novel state representation for reward-free Markov decision processes. The idea is to learn, in a self-supervised manner, an embedding space where distances between pairs of embedded states correspond to the minimum number of actions needed to transition between them. Compared to previous methods, our approach does not require any domain knowledge, learning from offline and unlabeled data. We show how this representation can be leveraged to learn goal-conditioned policies, providing a notion of similarity between states and goals and a useful heuristic distance to guide planning and reinforcement learning algorithms. Finally, we empirically validate our method in classic control domains and multi-goal environments, demonstrating that our method can successfully learn representations in large and/or continuous domains.
翻译:本文为无报酬的Markov 决策程序提供了一个新的国家代表。 想法是,以自我监督的方式学习嵌入空间,让嵌入国之间的距离与它们之间转型所需的最低行动数量相对应。 与以往的方法相比,我们的方法并不要求任何领域知识,从离线和无标签数据中学习。 我们展示了如何利用这种代表来学习有目标条件的政策,提供了国家和目标之间的相似性概念,以及指导规划和强化学习算法的有用的超长距离。 最后,我们用经验验证了我们在经典控制领域和多目标环境中的方法,表明我们的方法可以成功地在大型和/或连续领域学习。