Offline reinforcement learning (RL) leverages previously collected data for policy optimization without any further active exploration. Despite the recent interest in this problem, its theoretical results in neural network function approximation settings remain elusive. In this paper, we study the statistical theory of offline RL with deep ReLU network function approximation. In particular, we establish the sample complexity of $n = \tilde{\mathcal{O}}( H^{4 + 4 \frac{d}{\alpha}} \kappa_{\mu}^{1 + \frac{d}{\alpha}} \epsilon^{-2 - 2\frac{d}{\alpha}} )$ for offline RL with deep ReLU networks, where $\kappa_{\mu}$ is a measure of distributional shift, {$H = (1-\gamma)^{-1}$ is the effective horizon length}, $d$ is the dimension of the state-action space, $\alpha$ is a (possibly fractional) smoothness parameter of the underlying Markov decision process (MDP), and $\epsilon$ is a user-specified error. Notably, our sample complexity holds under two novel considerations: the Besov dynamic closure and the correlated structure. While the Besov dynamic closure subsumes the dynamic conditions for offline RL in the prior works, the correlated structure renders the prior works of offline RL with general/neural network function approximation improper or inefficient {in long (effective) horizon problems}. To the best of our knowledge, this is the first theoretical characterization of the sample complexity of offline RL with deep neural network function approximation under the general Besov regularity condition that goes beyond {the linearity regime} in the traditional Reproducing Hilbert kernel spaces and Neural Tangent Kernels.
翻译:离线强化学习 (RL) 利用先前收集的优化政策的数据,而没有进一步积极探索 。 尽管最近对这一问题感兴趣, 但它在神经网络运行的理论结果仍然难以找到。 在本文中, 我们研究离线 RLL 的统计理论, 与深 ReLU 网络运行。 特别是, 我们建立 $ =\ tile\ mathcal{O} (( H ⁇ 4 + 4\\\ frac{ dalpha}\\ kapapala ⁇ _ mu ⁇ 1 +\ frac} (d- halphalpha} =\ eepsilon}-2 - 2\ frac{ d- halpha} 设置的神经网络的理论结果。 我们的离线 RLL 运行的常规运行规则系统运行最平稳的参数是 IMOGL IMFIL 的常规运行法 。