用于强化学习的非平稳与变折扣马尔可夫决策过程 (Non-stationary and Varying-discounting Markov Decision Processes for Reinforcement Learning)

Algorithms developed under stationary Markov Decision Processes (MDPs) often face challenges in non-stationary environments, and infinite-horizon formulations may not directly apply to finite-horizon tasks. To address these limitations, we introduce the Non-stationary and Varying-discounting MDP (NVMDP) framework, which naturally accommodates non-stationarity and allows discount rates to vary with time and transitions. Infinite-horizon, stationary MDPs emerge as special cases of NVMDPs for identifying an optimal policy, and finite-horizon MDPs are also subsumed within the NVMDP formulations. Moreover, NVMDPs provide a flexible mechanism to shape optimal policies, without altering the state space, action space, or the reward structure. We establish the theoretical foundations of NVMDPs, including assumptions, state- and action-value formulation and recursion, matrix representation, optimality conditions, and policy improvement under finite state and action spaces. Building on these results, we adapt dynamic programming and generalized Q-learning algorithms to NVMDPs, along with formal convergence proofs. For problems requiring function approximation, we extend the Policy Gradient Theorem and the policy improvement bound in Trust Region Policy Optimization (TRPO), offering proofs in both scalar and matrix forms. Empirical evaluations in a non-stationary gridworld environment demonstrate that NVMDP-based algorithms successfully recover optimal trajectories under multiple reward and discounting schemes, whereas original Q-learning fails. These results collectively show that NVMDPs provide a theoretically sound and practically effective framework for reinforcement learning, requiring only minor algorithmic modifications while enabling robust handling of non-stationarity and explicit optimal policy shaping.

翻译：在平稳马尔可夫决策过程（MDPs）框架下开发的算法常面临非平稳环境的挑战，且无限时域设定无法直接适用于有限时域任务。为应对这些局限，本文提出非平稳与变折扣MDP（NVMDP）框架，该框架自然容纳非平稳性，并允许折扣率随时间与状态转移而变化。无限时域平稳MDP可作为NVMDP的特例用于求解最优策略，有限时域MDP同样可纳入NVMDP的表述体系。此外，NVMDP提供了一种灵活的机制来塑造最优策略，而无需改变状态空间、动作空间或奖励结构。我们建立了NVMDP的理论基础，包括假设条件、状态值与动作值的表述及递归、矩阵表示、最优性条件，以及在有限状态与动作空间下的策略改进理论。基于这些结果，我们将动态规划与广义Q学习算法适配至NVMDP框架，并给出形式化的收敛性证明。针对需函数逼近的问题，我们扩展了策略梯度定理与信任域策略优化（TRPO）中的策略改进界，并提供了标量与矩阵形式的证明。在非平稳网格环境中的实证评估表明，基于NVMDP的算法能在多种奖励与折扣方案下成功恢复最优轨迹，而原始Q学习算法则失效。这些结果共同表明，NVMDP为强化学习提供了一个理论严谨且实践有效的框架，仅需微小的算法修改即可实现对非平稳性的鲁棒处理与显式最优策略塑造。