Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive \emph{exponential} information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) \emph{realizability} holds, 2) the batch algorithm observes the exact reward and transition \emph{functions}, and 3) the batch algorithm is given the \emph{best} a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if the action-value function of \emph{every} policy admits a linear representation. If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an \emph{online} algorithm, showing that there exist RL problems where \emph{batch RL is exponentially harder than online RL} even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work introduces a new `oracle + batch algorithm' framework to prove lower bounds that hold for every distribution, and automatically recovers traditional fixed distribution lower bounds as a special case. Finally this work helps formalize the issue known as \emph{deadly triad} and explains that the \emph{bootstrapping} problem \citep{sutton2018reinforcement} is potentially more severe than the \emph{extrapolation} issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.
翻译:增强学习的多个实际应用包括从过去的数据中学习代理, 而没有进一步探索的可能性。 通常这些应用要求我们 1) 确定近乎最佳的政策, 或者 2) 估计目标政策的价值。 对于两个任务, 我们都会在折扣的无限地平线 MDP 中产生信息理论下界, 并包含一个线性函数表示动作值, 即使 1) \ emph{ 真实性} 持有, 2) 批量算法会观察到准确的奖赏和过渡 20 =emph{ 函数} 和 3) 传统算法, 批量算法会给问题类提供\ emph{ 最优} 前端数据分配 。 此外, 如果数据集不是从政策滚动中产生, 那么更低的界限将维持住, 即使 \ emph{ every} 政策中的行动价值函数代表着一个线性表示 。 如果目标是找到一个接近最佳的政策, 我们发现这些硬事件很容易通过一个 emph{online} 运算来解决问题,, 显示存在特殊问题, \ emph{b} lister lide liver distrue dial liver lade lade ladeal lade lade lade ladeal lade lade 。