In this paper, we study the problem of learning safe control policies that are also effective -- i.e., maximizing the probability of satisfying the linear temporal logic (LTL) specification of the task, and the discounted reward capturing the (classic) control performance. We consider unknown environments that can be modeled as Markov decision processes (MDPs). We propose a model-free reinforcement learning algorithm that learns a policy that first maximizes the probability of ensuring the safety, then the probability of satisfying the given LTL specification and lastly, the sum of discounted Quality of Control (QoC) rewards. Finally, we illustrate the applicability of our RL-based approach on a case study.
翻译:在本文中,我们研究了学习同样有效的安全控制政策的问题 -- -- 即最大限度地提高满足任务线性时间逻辑(LTL)规格的概率,以及获得(经典)控制性能的折扣奖励。我们考虑了可以作为Markov决策程序(MDPs)模型模型的未知环境。我们提出了一种不使用模型的强化学习算法,这种算法首先学习了一种政策,即确保安全的可能性最大化,然后又学习了满足给定的LTL规格的概率,最后,还学习了折扣控制质量(QoC)奖励的总和。最后,我们举例说明了我们基于RL的方法在案例研究中的适用性。