关于Bellman的最佳性和强化学习原则,以促进受安全限制的Markov决策过程</s> (On Bellman's principle of optimality and Reinforcement learning for safety-constrained Markov decision process)

We study optimality for the safety-constrained Markov decision process which is the underlying framework for safe reinforcement learning. Specifically, we consider a constrained Markov decision process (with finite states and finite actions) where the goal of the decision maker is to reach a target set while avoiding an unsafe set(s) with certain probabilistic guarantees. Therefore the underlying Markov chain for any control policy will be multichain since by definition there exists a target set and an unsafe set. The decision maker also has to be optimal (with respect to a cost function) while navigating to the target set. This gives rise to a multi-objective optimization problem. We highlight the fact that Bellman's principle of optimality may not hold for constrained Markov decision problems with an underlying multichain structure (as shown by the counterexample). We resolve the counterexample by formulating the aforementioned multi-objective optimization problem as a zero-sum game and thereafter construct an asynchronous value iteration scheme for the Lagrangian (similar to Shapley's algorithm. Finally, we consider the reinforcement learning problem for the same and construct a modified Q-learning algorithm for learning the Lagrangian from data. We also provide a lower bound on the number of iterations required for learning the Lagrangian and corresponding error bounds.

翻译：我们研究的是安全限制的Markov 决策程序的最佳性,这是安全强化学习的基础框架。具体地说, 我们考虑的是限制的Markov 决策程序( 有有限的状态和有限的行动), 决策者的目标是达到一个目标组, 避免不安全的一组(s), 并带有某些概率性保证。因此, 任何控制政策的基本Markov 链将是多链的, 因为根据定义, 存在一个目标组和一个不安全的组。决策者在向目标组航行时也必须是最佳的( 成本函数的) 。这引起了一个多目标优化问题。我们强调, Bellman 的最佳性原则可能不会维持在一个基本多链结构( 对应示例显示 ) 下限制的Markov 决策问题。我们通过将上述多目标优化问题作为零和不安全的游戏来解决这个问题, 然后为 Lagrangian ( 类似于 Shaply 的算法 ) 构建一个不连贯的 Iteration 计划。最后, 我们考虑的是同一学习问题强化学习问题, 并构建一个修改的Q-rag 校正校正校校校校所需的校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校校 </s>