具有最大概率满意度的强化学习基于时间逻辑控制 (Reinforcement Learning Based Temporal Logic Control with Maximum Probabilistic Satisfaction)

This paper presents a model-free reinforcement learning (RL) algorithm to synthesize a control policy that maximizes the satisfaction probability of linear temporal logic (LTL) specifications. Due to the consideration of environment and motion uncertainties, we model the robot motion as a probabilistic labeled Markov decision process with unknown transition probabilities and unknown probabilistic label functions. The LTL task specification is converted to a limit deterministic generalized B\"uchi automaton (LDGBA) with several accepting sets to maintain dense rewards during learning. The novelty of applying LDGBA is to construct an embedded LDGBA (E-LDGBA) by designing a synchronous tracking-frontier function, which enables the record of non-visited accepting sets without increasing dimensional and computational complexity. With appropriate dependent reward and discount functions, rigorous analysis shows that any method that optimizes the expected discount return of the RL-based approach is guaranteed to find the optimal policy that maximizes the satisfaction probability of the LTL specifications. A model-free RL-based motion planning strategy is developed to generate the optimal policy in this paper. The effectiveness of the RL-based control synthesis is demonstrated via simulation and experimental results.

翻译：本文介绍了一种无模型强化学习(RL)算法,以综合一种控制政策,最大限度地提高线性时间逻辑(LTL)规格的满意度。由于对环境和运动不确定性的考虑,我们将机器人运动模拟为具有未知过渡概率和未知概率标签功能的隐性标记Markov决定过程,该过程具有未知过渡概率和未知概率标签功能。LTL任务规格转换为限制确定性通用B\\"uchi automaton(LDGBA),有几套接受的组合,以在学习期间保持密集的回报。应用LDGBA的新做法是通过设计一个同步跟踪前沿功能来构建一个嵌入式LDGBA(E-LDGBA),从而能够在不增加尺寸和计算复杂性的情况下记录非访问接收组合。在适当的依赖性奖励和折扣功能下,严格分析表明,任何优化基于RL方法的预期贴现回报的方法都得到保证找到最佳政策,从而最大限度地提高LTL规格的满意度。基于模型的动作规划战略是设计出一个无模型的移动式移动规划战略,以产生最佳政策,通过模拟模型进行模拟合成。