We present a new behavioural distance over the state space of a Markov decision process, and demonstrate the use of this distance as an effective means of shaping the learnt representations of deep reinforcement learning agents. While existing notions of state similarity are typically difficult to learn at scale due to high computational cost and lack of sample-based algorithms, our newly-proposed distance addresses both of these issues. In addition to providing detailed theoretical analysis, we provide empirical evidence that learning this distance alongside the value function yields structured and informative representations, including strong results on the Arcade Learning Environment benchmark.
翻译:我们在Markov决策程序的国家空间上展示了新的行为距离,并展示了利用这一距离作为形成深层增援学习人员学习表现的有效手段。 虽然由于计算成本高和缺乏基于抽样的算法,现有的州级相似概念通常难以大规模地学习,但我们新提议的距离解决了这两个问题。 除了提供详细的理论分析外,我们还提供了经验证据,证明学习这一距离与价值函数同时产生结构化和内容丰富的表述,包括Arcade学习环境基准的有力成果。