We investigate the fundamental problem of leveraging offline data to accelerate online reinforcement learning - a direction with strong potential but limited theoretical grounding. Our study centers on how to learn and apply value envelopes within this context. To this end, we introduce a principled two-stage framework: the first stage uses offline data to derive upper and lower bounds on value functions, while the second incorporates these learned bounds into online algorithms. Our method extends prior work by decoupling the upper and lower bounds, enabling more flexible and tighter approximations. In contrast to approaches that rely on fixed shaping functions, our envelopes are data-driven and explicitly modeled as random variables, with a filtration argument ensuring independence across phases. The analysis establishes high-probability regret bounds determined by two interpretable quantities, thereby providing a formal bridge between offline pre-training and online fine-tuning. Empirical results on tabular MDPs demonstrate substantial regret reductions compared with both UCBVI and prior methods.
翻译:我们研究了利用离线数据加速在线强化学习这一基础问题——该方向潜力巨大但理论基础有限。我们的研究聚焦于如何在此背景下学习并应用值包络。为此,我们提出了一个原则性的两阶段框架:第一阶段使用离线数据推导值函数的上界与下界,第二阶段将这些习得的边界整合到在线算法中。我们的方法通过解耦上下界扩展了先前工作,实现了更灵活且更紧致的近似。与依赖固定塑形函数的方法不同,我们的包络是数据驱动且显式建模为随机变量的,并通过滤子论证确保各阶段间的独立性。分析建立了由两个可解释量决定的高概率遗憾界,从而为离线预训练与在线微调之间提供了形式化的桥梁。在表格MDP上的实证结果表明,相较于UCBVI及先前方法,本方法能显著降低遗憾。