We propose a novel offline reinforcement learning (RL) algorithm, namely Value Iteration with Perturbed Rewards (VIPeR) which amalgamates the randomized value function idea with the pessimism principle. Most current offline RL algorithms explicitly construct statistical confidence regions to obtain pessimism via lower confidence bounds (LCB), which cannot easily scale to complex problems where a neural network is used to estimate the value functions. Instead, VIPeR implicitly obtains pessimism by simply perturbing the offline data multiple times with carefully-designed i.i.d Gaussian noises to learn an ensemble of estimated state-action values and acting greedily to the minimum of the ensemble. The estimated state-action values are obtained by fitting a parametric model (e.g. neural networks) to the perturbed datasets using gradient descent. As a result, VIPeR only needs $\mathcal{O}(1)$ time complexity for action selection while LCB-based algorithms require at least $\Omega(K^2)$, where $K$ is the total number of trajectories in the offline data. We also propose a novel data splitting technique that helps remove the potentially large log covering number in the learning bound. We prove that VIPeR yields a provable uncertainty quantifier with overparameterized neural networks and achieves an $\tilde{\mathcal{O}}\left( \frac{ \kappa H^{5/2} \tilde{d} }{\sqrt{K}} \right)$ sub-optimality where $\tilde{d}$ is the effective dimension, $H$ is the horizon length and $\kappa$ measures the distributional shift. We corroborate the statistical and computational efficiency of VIPeR with an empirical evaluation in a wide set of synthetic and real-world datasets. To the best of our knowledge, VIPeR is the first offline RL algorithm that is both provably and computationally efficient in general Markov decision processes (MDPs) with neural network function approximation.
翻译:我们提出一个新的离线强化学习( RL) 算法, 即 与 Pertured Rewards (VIPER) 混合随机值函数概念和悲观原则。 多数当前离线 RL 算法明确构建了统计信心区域, 以便通过较低的信任度( LCB) 获得悲观。 当使用神经网络来估算值函数时, 无法轻易地将这种区域放大到复杂的问题。 相反, VIPER 暗含了悲观主义, 简单地通过仔细设计的 i. i. d. d Gausian 声音来将随机随机值值函数与悲观值函数合并起来 。 VIPER 只需要$martal mathcal{O} 和基于 LCB 算法的评估需要至少 $Omega (KNSO2) 和 贪婪的计算值, 其中, 美元=Knalal- dealdealdeal ladeal ladeal lade disal 。</s>