We study the statistical theory of offline reinforcement learning (RL) with deep ReLU network function approximation. We analyze a variant of fitted-Q iteration (FQI) algorithm under a new dynamic condition that we call Besov dynamic closure, which encompasses the conditions from prior analyses for deep neural network function approximation. Under Besov dynamic closure, we prove that the FQI-type algorithm enjoys the sample complexity of $\tilde{\mathcal{O}}\left( \kappa^{1 + d/\alpha} \cdot \epsilon^{-2 - 2d/\alpha} \right)$ where $\kappa$ is a distribution shift measure, $d$ is the dimensionality of the state-action space, $\alpha$ is the (possibly fractional) smoothness parameter of the underlying MDP, and $\epsilon$ is a user-specified precision. This is an improvement over the sample complexity of $\tilde{\mathcal{O}}\left( K \cdot \kappa^{2 + d/\alpha} \cdot \epsilon^{-2 - d/\alpha} \right)$ in the prior result [Yang et al., 2019] where $K$ is an algorithmic iteration number which is arbitrarily large in practice. Importantly, our sample complexity is obtained under the new general dynamic condition and a data-dependent structure where the latter is either ignored in prior algorithms or improperly handled by prior analyses. This is the first comprehensive analysis for offline RL with deep ReLU network function approximation under a general setting.
翻译:我们用深 ReLU 网络函数近似值来研究离线强化学习的统计理论( RL ) 。 我们分析一个变异的配置- Q 迭代算法( FQI), 在一个新的动态条件下, 我们称之为 Besov 动态关闭, 其中包括先前分析深神经网络功能近似的条件。 在 Besov 动态关闭下, 我们证明 FQI 类型算法拥有$tilde\mathcal{Oright (\ kappa_ 1 + d/ alpha}\ cddddddddot right) 的样本复杂度。 在这种新变异的变异性中, $kaptapaa 是一个分配转换的度, $K\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\\