This paper addresses the problem of learning control policies for mobile robots, modeled as unknown Markov Decision Processes (MDPs), that are tasked with temporal logic missions, such as sequencing, coverage, or surveillance. The MDP captures uncertainty in the workspace structure and the outcomes of control decisions. The control objective is to synthesize a control policy that maximizes the probability of accomplishing a high-level task, specified as a Linear Temporal Logic (LTL) formula. To address this problem, we propose a novel accelerated model-based reinforcement learning (RL) algorithm for LTL control objectives that is capable of learning control policies significantly faster than related approaches. Its sample-efficiency relies on biasing exploration towards directions that may contribute to task satisfaction. This is accomplished by leveraging an automaton representation of the LTL task as well as a continuously learned MDP model. Finally, we provide comparative experiments that demonstrate the sample efficiency of the proposed method against recent RL methods for LTL objectives.
翻译:本文探讨移动机器人的学习控制政策问题,这些机器人以未知的Markov决定程序(MDPs)为模范,负责时间逻辑任务,如测序、覆盖范围或监视等。MDP捕捉工作空间结构和控制决定结果的不确定性。控制目标是综合一项控制政策,最大限度地提高完成一项高层次任务的可能性,具体称为线性时空逻辑(LTL)公式。为解决这一问题,我们提议为LTL控制目标采用一种新的加速模型强化学习算法,这种算法能够大大加快学习控制政策的速度。其样本效率取决于对可能有助于任务满意的方向的偏向。这是通过利用LTL任务的一个自动图示以及一个不断学习的MDP模型来实现的。最后,我们提供比较实验,以证明拟议方法相对于最近的LT目标的RL方法的抽样效率。