用于批量强化学习的下下下光宽:批量 RL 可能比在线 RL 更难 (Exponential Lower Bounds for Batch Reinforcement Learning: Batch RL can be Exponentially Harder than Online RL)

Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive \emph{exponential} information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) \emph{realizability} holds, 2) the batch algorithm observes the exact reward and transition \emph{functions}, and 3) the batch algorithm is given the \emph{best} a priori data distribution for the problem class. Furthermore, if the dataset does not come from policy rollouts then the lower bounds hold even if the action-value function of \emph{every} policy admits a linear representation. If the objective is to find a near-optimal policy, we discover that these hard instances are easily solved by an \emph{online} algorithm, showing that there exist RL problems where \emph{batch RL is exponentially harder than online RL} even under the most favorable batch data distribution. In other words, online exploration is critical to enable sample efficient RL with function approximation. A second corollary is the exponential separation between finite and infinite horizon batch problems under our assumptions. On a technical level, this work introduces a new `oracle + batch algorithm' framework to prove lower bounds that hold for every distribution, and automatically recovers traditional fixed distribution lower bounds as a special case. Finally this work helps formalize the issue known as \emph{deadly triad} and explains that the \emph{bootstrapping} problem \citep{sutton2018reinforcement} is potentially more severe than the \emph{extrapolation} issue for RL because unlike the latter, bootstrapping cannot be mitigated by adding more samples.

翻译：增强学习的多个实际应用包括从过去的数据中学习代理, 而没有进一步探索的可能性。通常这些应用要求我们 1) 确定近乎最佳的政策, 或者 2) 估计目标政策的价值。对于两个任务, 我们都会在折扣的无限地平线 MDP 中产生信息理论下界, 并包含一个线性函数表示动作值, 即使 1) \ emph{ 真实性} 持有, 2) 批量算法会观察到准确的奖赏和过渡 20 =emph{ 函数} 和 3) 传统算法, 批量算法会给问题类提供\ emph{ 最优} 前端数据分配。此外, 如果数据集不是从政策滚动中产生, 那么更低的界限将维持住, 即使 \ emph{ every} 政策中的行动价值函数代表着一个线性表示。如果目标是找到一个接近最佳的政策, 我们发现这些硬事件很容易通过一个 emph{online} 运算来解决问题,, 显示存在特殊问题, \ emph{b} lister lide liver distrue dial liver lade lade ladeal lade lade lade ladeal lade lade 。