In offline reinforcement learning (RL) we have no opportunity to explore so we must make assumptions that the data is sufficient to guide picking a good policy, taking the form of assuming some coverage, realizability, Bellman completeness, and/or hard margin (gap). In this work we propose value-based algorithms for offline RL with PAC guarantees under just partial coverage, specifically, coverage of just a single comparator policy, and realizability of soft (entropy-regularized) Q-function of the single policy and a related function defined as a saddle point of certain minimax optimization problem. This offers refined and generally more lax conditions for offline RL. We further show an analogous result for vanilla Q-functions under a soft margin condition. To attain these guarantees, we leverage novel minimax learning algorithms to accurately estimate soft or vanilla Q-functions with $L^2$-convergence guarantees. Our algorithms' loss functions arise from casting the estimation problems as nonlinear convex optimization problems and Lagrangifying.
翻译:在离线强化学习(RL)中,我们没有机会探索,因此,我们必须假设数据足以指导选择好的政策,其形式是假设某些覆盖面、可变性、贝尔曼完整性和/或硬边(差距)等。在这项工作中,我们提议了离线RL的基于价值的算法,PAC的保障只是部分覆盖,具体地说,仅仅覆盖单一的参照政策,软性(非湿性-常规)保证的可变性(软性-常规)Q的功能和被界定为某些微缩最大优化问题的支撑点的相关功能。这为离线RL提供了精细和一般更为宽松的条件。我们进一步展示了在软边框条件下对香草的功能的类似结果。为了实现这些保证,我们利用新的微型学习算法来精确估算软性或香草的功能,用$L%2美元-converggence保证。我们的算法损失功能产生于将估计问题投出为非线性连接优化问题和拉格拼凑。