Offline policy evaluation is a fundamental statistical problem in reinforcement learning that involves estimating the value function of some decision-making policy given data collected by a potentially different policy. In order to tackle problems with complex, high-dimensional observations, there has been significant interest from theoreticians and practitioners alike in understanding the possibility of function approximation in reinforcement learning. Despite significant study, a sharp characterization of when we might expect offline policy evaluation to be tractable, even in the simplest setting of linear function approximation, has so far remained elusive, with a surprising number of strong negative results recently appearing in the literature. In this work, we identify simple control-theoretic and linear-algebraic conditions that are necessary and sufficient for classical methods, in particular Fitted Q-iteration (FQI) and least squares temporal difference learning (LSTD), to succeed at offline policy evaluation. Using this characterization, we establish a precise hierarchy of regimes under which these estimators succeed. We prove that LSTD works under strictly weaker conditions than FQI. Furthermore, we establish that if a problem is not solvable via LSTD, then it cannot be solved by a broad class of linear estimators, even in the limit of infinite data. Taken together, our results provide a complete picture of the behavior of linear estimators for offline policy evaluation, unify previously disparate analyses of canonical algorithms, and provide significantly sharper notions of the underlying statistical complexity of offline policy evaluation.
翻译:离线政策评价是强化学习的一个根本性统计问题,它涉及估计某些决策政策中由可能不同的政策所收集的数据的价值功能。为了解决复杂、高层次的观测问题,理论家和从业者都非常希望了解功能近似的可能性,加强学习。尽管进行了大量研究,但对于我们预期离线政策评价何时可以推广,即使是在最简单的线性函数近距离设置中,迄今为止仍然难以确定离线性政策评价何时可以推广的精确描述,文献中最近出现了数量惊人的强烈负面结果。在这项工作中,我们发现简单的控制理论和线性地理镜表条件,这些条件对于古典方法,特别是 " 匹配 " (FQI)和最少平方时间差异学习(LSTD),是必要和充分的。尽管如此,我们对离线性政策评价何时可能实现离线性评估的精确等级划分,我们证明LSTD在严格弱于FQI的条件下运作。此外,我们确认,如果一个问题不能通过LSTD进行清晰的统计原则和直线性政策分析,那么,那么,通过一个不完全的直径直线性政策分析的直径直径直径直观分析不能通过我们的类提供我们的政策结果的完整。