We consider the challenge of finding a deterministic policy for a Markov decision process that uniformly (in all states) maximizes one reward subject to a probabilistic constraint over a different reward. Existing solutions do not fully address our precise problem definition, which nevertheless arises naturally in the context of safety-critical robotic systems. This class of problem is known to be hard, but the combined requirements of determinism and uniform optimality can create learning instability. In this work, after describing and motivating our problem with a simple example, we present a suitable constrained reinforcement learning algorithm that prevents learning instability, using recursive constraints. Our proposed approach admits an approximative form that improves efficiency and is conservative w.r.t. the constraint.
翻译:我们认为,为马克夫决策程序找到一种确定性政策是一项挑战,即统一地(在所有各州)使一项奖励最大化,但以不同奖励的概率限制为条件。现有的解决方案并不完全解决我们的确切问题定义,尽管这个问题在安全临界机器人系统中自然产生。众所周知,这一类问题很难解决,但确定性和统一最佳性的综合要求可以造成学习不稳定。在这项工作中,我们用一个简单的例子来描述和激发我们的问题,然后我们提出了一个适当的限制性强化学习算法,利用反复的制约来防止学习不稳定。我们建议的方法承认一种相似的形式,既提高效率,又具有保守性。