Several practical applications of reinforcement learning involve an agent learning from past data without the possibility of further exploration. Often these applications require us to 1) identify a near optimal policy or to 2) estimate the value of a target policy. For both tasks we derive \emph{exponential} information-theoretic lower bounds in discounted infinite horizon MDPs with a linear function representation for the action value function even if 1) \emph{realizability} holds, 2) the batch algorithm observes the exact reward and transition \emph{functions}, and 3) the batch algorithm is given the \emph{best} a priori data distribution for the problem class. Our work introduces a new `oracle + batch algorithm' framework to prove lower bounds that hold for every distribution. The work shows an exponential separation between batch and online reinforcement learning.
翻译:强化学习的若干实际应用包括从过去的数据中学习代理人,而没有进一步探索的可能性。这些应用往往要求我们(1) 确定近乎最佳的政策,或者(2) 估计目标政策的价值。对于这两项任务,我们都会在折扣的无限地平线MDP中得出信息理论下限,并有线性功能表示动作值函数,即使1) \ emph{ reality} 持有,(2) 批量算法观察准确的奖赏和过渡 \ emph{ 函数} 和(3) 批量算法观察准确的奖赏和过渡 \ emph{ best} 和 3) 批量算法为问题类提供先期数据分布。我们的工作引入了新的“ 轴+批量算法” 框架, 以证明每个批量和在线强化学习的下限。 这项工作显示了批量和在线强化学习之间的指数分解。