Offline reinforcement learning, which aims at optimizing sequential decision-making strategies with historical data, has been extensively applied in real-life applications. State-Of-The-Art algorithms usually leverage powerful function approximators (e.g. neural networks) to alleviate the sample complexity hurdle for better empirical performances. Despite the successes, a more systematic understanding of the statistical complexity for function approximation remains lacking. Towards bridging the gap, we take a step by considering offline reinforcement learning with differentiable function class approximation (DFA). This function class naturally incorporates a wide range of models with nonlinear/nonconvex structures. Most importantly, we show offline RL with differentiable function approximation is provably efficient by analyzing the pessimistic fitted Q-learning (PFQL) algorithm, and our results provide the theoretical basis for understanding a variety of practical heuristics that rely on Fitted Q-Iteration style design. In addition, we further improve our guarantee with a tighter instance-dependent characterization. We hope our work could draw interest in studying reinforcement learning with differentiable function approximation beyond the scope of current research.
翻译:离线强化学习旨在根据历史数据优化顺序决策战略,在现实应用中广泛应用了离线强化学习,旨在优化历史数据,在实际应用中广泛应用了离线强化学习。 国家艺术算法通常利用强大的功能近似器(例如神经网络)来利用强大的功能近似器(例如神经网络)来减轻样本复杂性障碍,以更好的经验性表现。尽管取得了成功,但仍缺乏对功能近似的统计复杂性的更系统化理解。为了缩小差距,我们迈出了一步,考虑用不同功能类近似(DFA)来强化离线强化学习。这一功能类自然包括了非线性/非连接结构的广泛模型。 最重要的是,我们通过分析悲观的适合Q学习算法(PFQL)和我们的成果为理解各种实际的外观设计提供了理论基础,我们的工作能够以比当前研究范围更接近的不同功能强化学习,我们希望我们的工作能够吸引兴趣。