政策调整:缩小抽样有效离线和在线强化学习差距 (Policy Finetuning: Bridging Sample-Efficient Offline and Online Reinforcement Learning)

Recent theoretical work studies sample-efficient reinforcement learning (RL) extensively in two settings: learning interactively in the environment (online RL), or learning from an offline dataset (offline RL). However, existing algorithms and theories for learning near-optimal policies in these two settings are rather different and disconnected. Towards bridging this gap, this paper initiates the theoretical study of policy finetuning, that is, online RL where the learner has additional access to a "reference policy" $\mu$ close to the optimal policy $\pi_\star$ in a certain sense. We consider the policy finetuning problem in episodic Markov Decision Processes (MDPs) with $S$ states, $A$ actions, and horizon length $H$. We first design a sharp offline reduction algorithm -- which simply executes $\mu$ and runs offline policy optimization on the collected dataset -- that finds an $\varepsilon$ near-optimal policy within $\widetilde{O}(H^3SC^\star/\varepsilon^2)$ episodes, where $C^\star$ is the single-policy concentrability coefficient between $\mu$ and $\pi_\star$. This offline result is the first that matches the sample complexity lower bound in this setting, and resolves a recent open question in offline RL. We then establish an $\Omega(H^3S\min\{C^\star, A\}/\varepsilon^2)$ sample complexity lower bound for any policy finetuning algorithm, including those that can adaptively explore the environment. This implies that -- perhaps surprisingly -- the optimal policy finetuning algorithm is either offline reduction or a purely online RL algorithm that does not use $\mu$. Finally, we design a new hybrid offline/online algorithm for policy finetuning that achieves better sample complexity than both vanilla offline reduction and purely online RL algorithms, in a relaxed setting where $\mu$ only satisfies concentrability partially up to a certain time step.

翻译：在两个设置中,最近理论研究样本效率强化学习(RL)在两个设置中广泛进行:在环境中互动学习(在线RL),或从离线数据集(脱线RL)中学习。然而,在这两个设置中,现有用于学习近最佳政策的算法和理论非常不同和互不相干。为了缩小这一差距,本文件启动了政策微调的理论研究,即在线RL,学习者在“参考政策”方面拥有更多接近于最佳精度政策的 $\mu$(在线RL) 。我们认为,在Sind Markov 决策进程(MDPs)中,以美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、美元、一个最低、一个离线标、一个最低、一个离线、一个自动、一个离线、一个自动、一个IM、一个自动标、一个自动、一个自动、一个自动、一个自动、一个自动、一个调、一个自动、一台、一个调、一个调、一个调、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台、一台