Two central paradigms have emerged in the reinforcement learning (RL) community: online RL and offline RL. In the online RL setting, the agent has no prior knowledge of the environment, and must interact with it in order to find an $\epsilon$-optimal policy. In the offline RL setting, the learner instead has access to a fixed dataset to learn from, but is unable to otherwise interact with the environment, and must obtain the best policy it can from this offline data. Practical scenarios often motivate an intermediate setting: if we have some set of offline data and, in addition, may also interact with the environment, how can we best use the offline data to minimize the number of online interactions necessary to learn an $\epsilon$-optimal policy? In this work, we consider this setting, which we call the \textsf{FineTuneRL} setting, for MDPs with linear structure. We characterize the necessary number of online samples needed in this setting given access to some offline dataset, and develop an algorithm, \textsc{FTPedel}, which is provably optimal. We show through an explicit example that combining offline data with online interactions can lead to a provable improvement over either purely offline or purely online RL. Finally, our results illustrate the distinction between \emph{verifiable} learning, the typical setting considered in online RL, and \emph{unverifiable} learning, the setting often considered in offline RL, and show that there is a formal separation between these regimes.
翻译:强化学习( RL) 社群中出现了两个中心范例: 在线 RL 和 离线 RL 。 在在线 RL 设置中, 代理器没有对环境的先前知识, 并且必须与其互动, 以便找到一个$\ epsilon$- 最佳政策 。 在离线 RL 设置中, 学习者可以使用固定的数据集学习, 但无法以其他方式与环境互动, 并且必须从此离线数据中获得最佳的政策 。 实际情景往往激励中间环境: 如果我们有一些离线数据集, 此外, 也可以与环境互动, 我们如何最好地使用离线数据来减少为学习$\ epslon$- 最佳政策所需的在线互动数量? 在离线 RL 设置中, 我们考虑过这个设置, 我们用离线数据设置中所需的在线样本数量, 并且开发一个算法, Rcextcr/ FTeldel} 和 在线系统之间的典型互动, 我们经常通过在线排序来显示一个清晰的排序。