Partially-Observable Markov Decision Processes (POMDPs) are a well-known stochastic model for sequential decision making under limited information. We consider the EXPTIME-hard problem of synthesising policies that almost-surely reach some goal state without ever visiting a bad state. In particular, we are interested in computing the winning region, that is, the set of system configurations from which a policy exists that satisfies the reachability specification. A direct application of such a winning region is the safe exploration of POMDPs by, for instance, restricting the behavior of a reinforcement learning agent to the region. We present two algorithms: A novel SAT-based iterative approach and a decision-diagram based alternative. The empirical evaluation demonstrates the feasibility and efficacy of the approaches.
翻译:部分可观测的Markov 决策程序(POMDPs)是一个众所周知的在有限信息条件下进行顺序决策的随机模型。我们考虑了EXPTIME在综合政策方面所面临的难题,这种政策几乎肯定能够达到某一目标,而从未出现过坏状况。我们特别有兴趣计算中标区域,即一套符合可达性规格的政策的系统配置。这种中标区域的直接应用是安全地探索POMDPs,例如通过将强化学习代理人的行为限制在区域内。我们提出了两种算法:一种新的SAT基迭代方法和基于决策图的替代方法。经验评估显示了这些方法的可行性和有效性。