Risk-bounded motion planning is an important yet difficult problem for safety-critical tasks. While existing mathematical programming methods offer theoretical guarantees in the context of constrained Markov decision processes, they either lack scalability in solving larger problems or produce conservative plans. Recent advances in deep reinforcement learning improve scalability by learning policy networks as function approximators. In this paper, we propose an extension of soft actor critic model to estimate the execution risk of a plan through a risk critic and produce risk-bounded policies efficiently by adding an extra risk term in the loss function of the policy network. We define the execution risk in an accurate form, as opposed to approximating it through a summation of immediate risks at each time step that leads to conservative plans. Our proposed model is conditioned on a continuous spectrum of risk bounds, allowing the user to adjust the risk-averse level of the agent on the fly. Through a set of experiments, we show the advantage of our model in terms of both computational time and plan quality, compared to a state-of-the-art mathematical programming baseline, and validate its performance in more complicated scenarios, including nonlinear dynamics and larger state space.
翻译:具有风险的机动规划是安全关键任务的一个重要而棘手的问题。现有的数学编程方法在限制的Markov决策过程中提供了理论保障,但它们要么在解决更大的问题时缺乏伸缩性,要么产生保守的计划。 深入强化学习最近的进展通过学习政策网络作为功能匹配者来提高伸缩性。 在本文中,我们提议扩大软行为者批评模型,通过风险评论家来估计计划的执行风险,并通过在政策网络的损失功能中增加一个额外的风险术语来有效制定有风险的政策。我们以准确的形式界定执行风险,而不是通过在每次步骤上对直接风险进行总结,从而导致保守的计划。我们提议的模型以连续的风险范围为条件,使用户能够调整飞行代理的风险偏向水平。通过一系列实验,我们展示了我们的模型在计算时间和计划质量两方面的优势,与最先进的数学规划基线相比,并在更复杂的情景中验证其绩效,包括非线性动态和更大的空间。