This paper proposes a reinforcement learning method for controller synthesis of autonomous systems in unknown and partially-observable environments with subjective time-dependent safety constraints. Mathematically, we model the system dynamics by a partially-observable Markov decision process (POMDP) with unknown transition/observation probabilities. The time-dependent safety constraint is captured by iLTL, a variation of linear temporal logic for state distributions. Our Reinforcement learning method first constructs the belief MDP of the POMDP, capturing the time evolution of estimated state distributions. Then, by building the product belief MDP of the belief MDP and the limiting deterministic B\uchi automaton (LDBA) of the temporal logic constraint, we transform the time-dependent safety constraint on the POMDP into a state-dependent constraint on the product belief MDP. Finally, we learn the optimal policy by value iteration under the state-dependent constraint.
翻译:本文建议了一种强化的学习方法,用于控制在未知和部分可观测环境中的自主系统合成控制器,并带有主观的、取决于时间的安全限制。从数学角度讲,我们用部分可观测的马尔科夫决定过程(POMDP)来模拟系统动态,其过渡/观察概率未知。取决于时间的安全限制由iLTL(国家分布线性时间逻辑的变异)来捕捉。我们的强化学习方法首先构建了POMDP的信念MDP(MDP),捕捉了估计国家分布的时间演变。然后,通过建立信仰MDP(MDP)的产品信仰MDP(MDP)和限制时间逻辑约束的确定性B\uchi Outomaton(LDBBA)(LDBA)的产品信仰MDP(MDP)的MDP(MDP),我们把对POMDP(P)依赖时间的安全限制转化为对产品信仰MDP(MDP)的依赖于国家的限制。最后,我们通过根据国家约束的数值来学习最佳政策。