Intelligent agents can cope with sensory-rich environments by learning task-agnostic state abstractions. In this paper, we propose an algorithm to approximate causal states, which are the coarsest partition of the joint history of actions and observations in partially-observable Markov decision processes (POMDP). Our method learns approximate causal state representations from RNNs trained to predict subsequent observations given the history. We demonstrate that these learned state representations are useful for learning policies efficiently in reinforcement learning problems with rich observation spaces. We connect causal states with causal feature sets from the causal inference literature, and also provide theoretical guarantees on the optimality of the continuous version of this causal state representation under Lipschitz assumptions by proving equivalence to bisimulation, a relation between behaviorally equivalent systems. This allows for lower bounds on the optimal value function of the learned representation, which is tight given certain assumptions. Finally, we empirically evaluate causal state representations using multiple partially observable tasks and compare with prior methods.
翻译:智能分子可以通过学习任务和不可知状态的抽象学来应付感官丰富的环境。 在本文中,我们提出一种算法,以估计因果关系状态,即部分可观察的Markov决策过程中共同行动和观察历史的粗略分布。我们的方法从受过训练的RNN公司那里学到了近似因果状态的表述,根据历史可以预测随后的观察结果。我们证明,这些学到的状态表述有助于学习政策,有效地与丰富的观测空间加强学习问题。我们把因果状态与因果参数组合连接起来,并通过证明对等行为系统之间的关系,为利普施茨假设中这种因果状态代表的连续版本的最佳性提供理论保障。这样可以降低所学的代表性的最佳价值功能的界限,因为根据某些假设,这是很紧的。最后,我们用多种部分可观察的任务对因果关系状态表述进行实证评估,并与先前的方法进行比较。