The problem of Offline Policy Evaluation (OPE) in Reinforcement Learning (RL) is a critical step towards applying RL in real-life applications. Existing work on OPE mostly focus on evaluating a fixed target policy $\pi$, which does not provide useful bounds for offline policy learning as $\pi$ will then be data-dependent. We address this problem by simultaneously evaluating all policies in a policy class $\Pi$ -- uniform convergence in OPE -- and obtain nearly optimal error bounds for a number of global / local policy classes. Our results imply that the model-based planning achieves an optimal episode complexity of $\widetilde{O}(H^3/d_m\epsilon^2)$ in identifying an $\epsilon$-optimal policy under the time-inhomogeneous episodic MDP model ($H$ is the planning horizon, $d_m$ is a quantity that reflects the exploration of the logging policy $\mu$). To the best of our knowledge, this is the first time the optimal rate is shown to be possible for the offline RL setting and the paper is the first that systematically investigates the uniform convergence in OPE.
翻译:强化学习中的离线政策评价(OPE)问题是在实际应用中应用RL的关键一步。关于OPE的现有工作主要侧重于评估固定目标政策$pi$(pi$),这不会为离线政策学习提供有用的界限,因为美元美元将依赖于数据。我们通过在政策级别同时评估所有政策,即$\Pi$(Pi$) -- -- OP的统一合并 -- -- 并获得一些全球/地方政策等级的近乎最佳的错误界限来解决这个问题。我们的成果意味着基于模型的规划在确定时间-不均匀的聚离线MDP模式下的美元政策时,实现了基于模型的规划最优化的集成($\mu$)(H3/d_m_epsilon>2美元),从而在确定离线的离线统一时,首次显示最佳比率。