The fundamental assumption of reinforcement learning in Markov decision processes (MDPs) is that the relevant decision process is, in fact, Markov. However, when MDPs have rich observations, agents typically learn by way of an abstract state representation, and such representations are not guaranteed to preserve the Markov property. We introduce a novel set of conditions and prove that they are sufficient for learning a Markov abstract state representation. We then describe a practical training procedure that combines inverse model estimation and temporal contrastive learning to learn an abstraction that approximately satisfies these conditions. Our novel training objective is compatible with both online and offline training: it does not require a reward signal, but agents can capitalize on reward information when available. We empirically evaluate our approach on a visual gridworld domain and a set of continuous control benchmarks. Our approach learns representations that capture the underlying structure of the domain and lead to improved sample efficiency over state-of-the-art deep reinforcement learning with visual features -- often matching or exceeding the performance achieved with hand-designed compact state information.
翻译:在Markov决策程序中强化学习的基本假设是,相关的决策过程实际上是Markov。然而,当MDPs有丰富的观测结果时,代理人通常通过抽象的国家代表方式学习,而这种代表方式不能保证Markov财产的保存。我们引入了一套新颖的条件,并证明它们足以学习Markov抽象的国家代表方式。我们接着描述了一种实用的培训程序,它将反向模型估计和时间对比学习结合起来,学习大约满足这些条件的抽象性。我们的新培训目标与在线和离线培训是兼容的:它不需要奖励信号,但是代理人可以利用现有的奖励信息。我们从经验上评估我们在视觉网域域和一套连续控制基准上的方法。我们的方法从中了解到能够捕捉到域基本结构并导致用视觉特征来改进最先进的深层强化学习的样本效率,这些特征往往与手设计的压缩状态信息相匹配或超过业绩。