There have been many recent advances on provably efficient Reinforcement Learning (RL) in problems with rich observation spaces. However, all these works share a strong realizability assumption about the optimal value function of the true MDP. Such realizability assumptions are often too strong to hold in practice. In this work, we consider the more realistic setting of agnostic RL with rich observation spaces and a fixed class of policies $\Pi$ that may not contain any near-optimal policy. We provide an algorithm for this setting whose error is bounded in terms of the rank $d$ of the underlying MDP. Specifically, our algorithm enjoys a sample complexity bound of $\widetilde{O}\left((H^{4d} K^{3d} \log |\Pi|)/\epsilon^2\right)$ where $H$ is the length of episodes, $K$ is the number of actions and $\epsilon>0$ is the desired sub-optimality. We also provide a nearly matching lower bound for this agnostic setting that shows that the exponential dependence on rank is unavoidable, without further assumptions.
翻译:在观测空间丰富的问题中,最近出现了许多进展,在可察觉到的高效强化学习(RL)方面出现了许多进展。然而,所有这些工程都对真正的 MDP 的最佳价值功能有着强烈的真实性假设。 这样的真实性假设往往太强,无法在实际中坚持。 在这项工作中,我们认为更现实地设置具有丰富的观测空间和固定政策等级的不可知性RL, 美元可能不包含任何接近最佳的政策。 我们为这一设置提供了一种算法,其错误以基础 MDP 的美元等级为界限。 具体地说, 我们的算法具有一个样本复杂性, 包括 $\ 全局性( H ⁇ 4d} K ⁇ 3d}\ log ⁇ Pi ⁇ ) /\ exsilon% 2\right $, 其中, 美元是时间长度, 美元是行动的数量, 而 $\\ silon>0美元是理想的亚最佳性。 我们还为这一配置设置提供了近乎更低的界限。 我们还为这个缩的设置提供了近匹配, 表明对品级的高度依赖是不可避免的假设。