Traditional reinforcement learning (RL) assumes the agents make decisions based on Markov decision processes (MDPs) with one-step transition models. In many real-world applications, such as energy management and stock investment, agents can access multi-step predictions of future states, which provide additional advantages for decision making. However, multi-step predictions are inherently high-dimensional: naively embedding these predictions into an MDP leads to an exponential blow-up in state space and the curse of dimensionality. Moreover, existing RL theory provides few tools to analyze prediction-augmented MDPs, as it typically works on one-step transition kernels and cannot accommodate multi-step predictions with errors or partial action-coverage. We address these challenges with three key innovations: First, we propose the \emph{Bayesian value function} to characterize the optimal prediction-aware policy tractably. Second, we develop a novel \emph{Bellman-Jensen Gap} analysis on the Bayesian value function, which enables characterizing the value of imperfect predictions. Third, we introduce BOLA (Bayesian Offline Learning with Online Adaptation), a two-stage model-based RL algorithm that separates offline Bayesian value learning from lightweight online adaptation to real-time predictions. We prove that BOLA remains sample-efficient even under imperfect predictions. We validate our theory and algorithm on synthetic MDPs and a real-world wind energy storage control problem.
翻译:传统强化学习假设智能体基于具有单步转移模型的马尔可夫决策过程进行决策。在许多实际应用中,例如能源管理与股票投资,智能体能够获取对未来状态的多步预测,这为决策提供了额外优势。然而,多步预测本质上是高维的:若将其直接嵌入MDP会导致状态空间呈指数级膨胀及维度灾难。此外,现有强化学习理论鲜有工具可用于分析预测增强型MDP,因其通常基于单步转移核构建,难以处理存在误差或动作覆盖不全的多步预测。我们通过三项关键创新应对这些挑战:首先,我们提出**贝叶斯值函数**以可处理的方式刻画最优的预测感知策略。其次,我们在贝叶斯值函数上建立了新颖的**贝尔曼-詹森间隙**分析框架,从而能够量化不完美预测的价值。第三,我们提出BOLA(基于贝叶斯的离线学习与在线适应算法),这是一种两阶段基于模型的强化学习算法,将离线贝叶斯值学习与面向实时预测的轻量级在线适应过程相分离。我们证明即使在不完美预测条件下,BOLA仍能保持样本高效性。我们在合成MDP及实际风能储能控制问题上验证了理论与算法的有效性。