In safe MDP planning, a cost function based on the current state and action is often used to specify safety aspects. In the real world, often the state representation used may lack sufficient fidelity to specify such safety constraints. Operating based on an incomplete model can often produce unintended negative side effects (NSEs). To address these challenges, first, we associate safety signals with state-action trajectories (rather than just an immediate state-action). This makes our safety model highly general. We also assume categorical safety labels are given for different trajectories, rather than a numerical cost function, which is harder to specify by the problem designer. We then employ a supervised learning model to learn such non-Markovian safety patterns. Second, we develop a Lagrange multiplier method, which incorporates the safety model and the underlying MDP model in a single computation graph to facilitate agent learning of safe behaviors. Finally, our empirical results on a variety of discrete and continuous domains show that this approach can satisfy complex non-Markovian safety constraints while optimizing an agent's total returns, is highly scalable, and is also better than the previous best approach for Markovian NSEs.
翻译:在安全MDP规划中,基于当前状态和动作的成本函数通常用于指定安全性方面。在真实世界中,经常使用的状态表示可能缺乏足够的准确性来指定这些安全约束。基于不完整模型运行往往会产生意想不到的负面副作用(NSEs)。为了应对这些挑战,首先,我们将安全信号与状态-动作轨迹相关联(而不仅仅是即时状态-动作)。这使得我们的安全模型高度通用。我们还假设为不同轨迹给定分类安全标签,而不是更难由问题设计者指定的数值成本函数。然后,我们采用监督学习模型来学习这些非马尔可夫安全模式。其次,我们开发了一种拉格朗日乘数方法,将安全模型和基础MDP模型纳入单个计算图中,以促进代理学习安全行为。最后,我们在各种离散和连续域上的实证结果表明,该方法可以在优化代理的总回报的同时满足复杂的非马可夫安全约束,具有高度可扩展性,并且对于马可夫NSEs而言比以前最佳方法更好。