We consider the problem of offline reinforcement learning (RL) -- a well-motivated setting of RL that aims at policy optimization using only historical data. Despite its wide applicability, theoretical understandings of offline RL, such as its optimal sample complexity, remain largely open even in basic settings such as \emph{tabular} Markov Decision Processes (MDPs). In this paper, we propose Off-Policy Double Variance Reduction (OPDVR), a new variance reduction based algorithm for offline RL. Our main result shows that OPDVR provably identifies an $\epsilon$-optimal policy with $\widetilde{O}(H^2/d_m\epsilon^2)$ episodes of offline data in the finite-horizon stationary transition setting, where $H$ is the horizon length and $d_m$ is the minimal marginal state-action distribution induced by the behavior policy. This improves over the best known upper bound by a factor of $H$. Moreover, we establish an information-theoretic lower bound of $\Omega(H^2/d_m\epsilon^2)$ which certifies that OPDVR is optimal up to logarithmic factors. Lastly, we show that OPDVR also achieves rate-optimal sample complexity under alternative settings such as the finite-horizon MDPs with non-stationary transitions and the infinite horizon MDPs with discounted rewards.
翻译:我们认为离线加固学习(RL)问题 -- -- 以仅使用历史数据实现政策优化为目的的脱线强化学习(RL)的动机良好的RL设置。尽管它具有广泛适用性,但离线RL的理论理解,如最佳样本复杂性等,即使在诸如 mDPs(emph{tabular}Markov 决策程序(MDPs)等基本环境下,仍然基本开放。在本文中,我们提议离线双差减少(OPDVR),这是基于离线RL的基于新的差异减量算法。我们的主要结果显示,OPDVR(H)可以确定一个以$(eeplalitelde{O}(H2/d_m\mepselon2)为最佳政策优化的美元政策。在限制-hon-hoon固定过渡设置中,$(H$)是地平线长度,美元是行为政策引发的最低边际状态行动分布。这比已知的上限因数为$H$。此外,我们用美元(H2)的离值低度约束下的信息-Rralalalal-Ralalalalalalalalalalalalalalalalal-D(OVs)的OVs-D),在OPOVSlation-Slation-Slation-Sl)也显示OP-S-Rislislation-Rislation-S-Slexxx。