We revisit offline reinforcement learning on episodic time-homogeneous Markov Decision Processes (MDP). For tabular MDP with $S$ states and $A$ actions, or linear MDP with anchor points and feature dimension $d$, given the collected $K$ episodes data with minimum visiting probability of (anchor) state-action pairs $d_m$, we obtain nearly horizon $H$-free sample complexity bounds for offline reinforcement learning when the total reward is upper bounded by $1$. Specifically: 1. For offline policy evaluation, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} \right)$ error bound for the plug-in estimator, which matches the lower bound up to logarithmic factors and does not have additional dependency on $\mathrm{poly}\left(H, S, A, d\right)$ in higher-order term. 2.For offline policy optimization, we obtain an $\tilde{O}\left(\sqrt{\frac{1}{Kd_m}} + \frac{\min(S, d)}{Kd_m}\right)$ sub-optimality gap for the empirical optimal policy, which approaches the lower bound up to logarithmic factors and a high-order term, improving upon the best known result by \cite{cui2020plug} that has additional $\mathrm{poly}\left(H, S, d\right)$ factors in the main term. To the best of our knowledge, these are the \emph{first} set of nearly horizon-free bounds for episodic time-homogeneous offline tabular MDP and linear MDP with anchor points. Central to our analysis is a simple yet effective recursion based method to bound a ``total variance'' term in the offline scenarios, which could be of individual interest.
翻译:我们重新审视对超时热度的 Markov 决策进程( MDP ) 的离线强化学习。 对于有 $S 和 $A 动作的表格 MDP, 或有 锚点和功能维度的线性 MDP, 美元 美元, 因为所收集的 $K 片段数据与( anchor) 州- 双对配对最小访问概率 $d_ m 美元, 当总奖赏上限为 $20 时, 我们获得接近地平面 $H$- 无样本复杂度的离线强化学习 。 具体来说 : 对于 离线政策评价, 我们获得 $\ left{ O ⁇ left} (sleft) (sleft) (sleft) (s left) (right@ worlight) (1\\\\ kd_\\\\\\\\\\\\\\\\\\ right) m\\\ right) m\\ max max) m disal a restial maxal maxal maxal maxal max maxy max max max max max max max max maxl max max max maxx max max maxxxx, max maxxxx maxx max maxx maxxxxx max 最 maxxxxxx maxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx, maxxxxxxxxxxxxxxxx