In this paper, we consider solving discounted Markov Decision Processes (MDPs) under the constraint that the resulting policy is stabilizing. In practice MDPs are solved based on some form of policy approximation. We will leverage recent results proposing to use Model Predictive Control (MPC) as a structured policy in the context of Reinforcement Learning to make it possible to introduce stability requirements directly inside the MPC-based policy. This will restrict the solution of the MDP to stabilizing policies by construction. The stability theory for MPC is most mature for the undiscounted MPC case. Hence, we will first show in this paper that stable discounted MDPs can be reformulated as undiscounted ones. This observation will entail that the MPC-based policy with stability requirements will produce the optimal policy for the discounted MDP if it is stable, and the best stabilizing policy otherwise.
翻译:在本文中,我们考虑在由此产生的政策稳定化的限制下解决贴现的Markov决策程序(MDPs),在实践中,MDPs是根据某种形式的政策近似法解决的,我们将利用最近的结果,提议将模型预测控制(MPC)作为加强学习的结构性政策,以便能够直接在以MPC为基础的政策中引入稳定要求,这将限制MDP的解决方案通过建设来稳定政策。对于未贴现的MPC来说,MPC的稳定理论是最成熟的。因此,我们将首先在本文中表明,稳定的贴现 MDPs可以重新拟订为不贴现的政策。这一观察将意味着,具有稳定要求的MPC政策,如果是稳定的,则将产生贴现的MDP的最佳政策,否则是最佳的稳定政策。