In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation. The learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve a balance between approximation and estimation error of the classes. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal oracle inequalities up to logarithmic factors. The algorithm, ModBE, takes as input a collection of candidate model classes and a generic base offline RL algorithm. By successively eliminating model classes using a novel one-sided generalization test, ModBE returns a policy with regret scaling with the complexity of the minimally complete model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to solving a series of square loss regression problems and then comparing relative square loss between classes. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
翻译:在离线强化学习(RL)中,学习者利用先前的登录数据,在不与环境互动的情况下学习好的政策,在实践中应用这类方法的主要挑战是缺乏用于模型选择和评价的理论原则和实践工具。为了解决这个问题,我们研究了在离线RL中选择模型的问题,并研究了价值函数近似。给学习者提供了一组模型序列的嵌套式序列,以尽量减少方位贝尔曼错误,并且必须选择其中的模型序列,以达到分类的近似和估计误差之间的平衡。我们为离线的RL提出了第一个模型选择算法模型,该算法在达到对数因素之前实现微模模-最高或最高不平等。算法(modBE),将候选模型类和通用基准离线RL算作为输入。通过连续使用新式的单向全面统观测试, ModBE返回一项政策,对与最小完整模型类的复杂程度进行递增。除了理论上的保证外,它还在概念上简单且计算有效,相当于解决一系列的平方损失回归问题,然后将一个可靠的平方级类模拟进行对比。