This paper proposes a Reinforcement Learning (RL) algorithm to synthesize policies for a Markov Decision Process (MDP), such that a linear time property is satisfied. We convert the property into a Limit Deterministic Buchi Automaton (LDBA), then construct a product MDP between the automaton and the original MDP. A reward function is then assigned to the states of the product automaton, according to accepting conditions of the LDBA. With this reward function, our algorithm synthesizes a policy that satisfies the linear time property: as such, the policy synthesis procedure is "constrained" by the given specification. Additionally, we show that the RL procedure sets up an online value iteration method to calculate the maximum probability of satisfying the given property, at any given state of the MDP - a convergence proof for the procedure is provided. Finally, the performance of the algorithm is evaluated via a set of numerical examples. We observe an improvement of one order of magnitude in the number of iterations required for the synthesis compared to existing approaches.
翻译:本文提出“强化学习”算法,以综合Markov 决策程序的政策,从而满足线性时间属性。我们将属性转换成“限制决定因素 Buchi Automaton ” (LDBA),然后在自动映射器和原始MDP之间构建产品 MDP。然后,根据LDBA的接受条件,将奖励功能分配给产品自动映射状态。有了这一奖励功能,我们的算法将满足线性时间属性的政策合成为一种政策:因此,政策合成程序受特定规格的“制约 ” 。此外,我们表明,RL程序设置了在线值迭代法,以计算在任何给定状态满足特定属性的最大概率,即提供该程序的趋同证明。最后,通过一组数字示例评估该算法的性能。我们观察到,与现有方法相比,合成所需的迭代数数量有一定的大小。