To accumulate knowledge and improve its policy of behaviour, a reinforcement learning agent can learn `off-policy' about policies that differ from the policy used to generate its experience. This is important to learn counterfactuals, or because the experience was generated out of its own control. However, off-policy learning is non-trivial, and standard reinforcement-learning algorithms can be unstable and divergent. In this paper we discuss a novel family of off-policy prediction algorithms which are convergent by construction. The idea is to first learn on-policy about the data-generating behaviour, and then bootstrap an off-policy value estimate on this on-policy estimate, thereby constructing a value estimate that is partially off-policy. This process can be repeated to build a chain of value functions, each time bootstrapping a new estimate on the previous estimate in the chain. Each step in the chain is stable and hence the complete algorithm is guaranteed to be stable. Under mild conditions this comes arbitrarily close to the off-policy TD solution when we increase the length of the chain. Hence it can compute the solution even in cases where off-policy TD diverges. We prove that the proposed scheme is convergent and corresponds to an iterative decomposition of the inverse key matrix. Furthermore it can be interpreted as estimating a novel objective -- that we call a `k-step expedition' -- of following the target policy for finitely many steps before continuing indefinitely with the behaviour policy. Empirically we evaluate the idea on challenging MDPs such as Baird's counter example and observe favourable results.
翻译:为了积累知识和改进其行为政策,强化学习机构可以学习与产生其经验的政策不同的政策“政策”。这很重要,可以学习反事实,或者因为经验是自己控制产生的。然而,脱离政策学习是非三重性的,标准强化学习算法可以是不稳定和差异的。在本文中,我们讨论的是一个新的非政策预测算法,这种算法通过构建而趋同。想法是首先在政策上学习关于数据生成行为的政策,然后在这个政策上的估计中设置一个离政策值的估算,从而得出一个部分离政策的价值估计。这很重要,可以用来学习反事实,或者因为经验是源于自身的控制。但是,离政策学习不是三重的,而标准强化学习算法是非三重的,标准强化学习算法可能是不稳定和不同的。在本文中,我们讨论的是“离政策”的新的组合,在评估链的长度时,它会任意接近于非政策解决方案。因此,即使在政策上的偏离目标之前,它也可以算出一个非政策值的估算值值值值。我们证明,在评估一个核心的矩阵时,我们可以将一个核心的模型与一个方向相近。