冷冻慢速国家快速近似动态方案规划 (Faster Approximate Dynamic Programming by Freezing Slow States)

We consider infinite horizon Markov decision processes (MDPs) with fast-slow structure, meaning that certain parts of the state space move "fast" (and in a sense, are more influential) while other parts transition more "slowly." Such structure is common in real-world problems where sequential decisions need to be made at high frequencies, yet information that varies at a slower timescale also influences the optimal policy. Examples include: (1) service allocation for a multi-class queue with (slowly varying) stochastic costs, (2) a restless multi-armed bandit with an environmental state, and (3) energy demand response, where both day-ahead and real-time prices play a role in the firm's revenue. Models that fully capture these problems often result in MDPs with large state spaces and large effective time horizons (due to frequent decisions), rendering them computationally intractable. We propose an approximate dynamic programming algorithmic framework based on the idea of "freezing" the slow states, solving a set of simpler finite-horizon MDPs (the lower-level MDPs), and applying value iteration (VI) to an auxiliary MDP that transitions on a slower timescale (the upper-level MDP). We also extend the technique to a function approximation setting, where a feature-based linear architecture is used. On the theoretical side, we analyze the regret incurred by each variant of our frozen-state approach. Finally, we give empirical evidence that the frozen-state approach generates effective policies using just a fraction of the computational cost, while illustrating that simply omitting slow states from the decision modeling is often not a viable heuristic.

翻译：我们认为,无限的地平线 Markov 决策流程(MDPs ), 结构快速缓慢, 意味着州空间的某些部分会“快速”移动( 并且从某种意义上说, 更有影响力 ), 而其他部分会“ 更低 ” 。这种结构在现实世界的问题中很常见, 需要高频率做出顺序决定, 而信息在较慢的时间尺度上变化也会影响最佳政策。例如:(1) 为多级排队( 低度) 成本的多级排队提供服务分配,(2) 简单的不固定多臂多臂强,环境状态,以及(3) 能源需求反应, 白天和实时价格都在公司收入中扮演着作用。完全捕捉到这些问题的模型往往导致大型州空间和大有效时空的 MDP 问题( 由于频繁的决定), 使它们在计算上变得难以调。我们提议一个大致动态的算法框架, 其基础是“ 冻结” 模式, 解决一套简单的定型的 MDP( 低级 MDP ),, 并运用它的价值(VI) 快速地),, 将我们使用一个快速的逻辑结构。