We consider the problem: is the optimal expected total reward to reach a goal state in a partially observable Markov decision process (POMDP) below a given threshold? We tackle this -- generally undecidable -- problem by computing under-approximations on these total expected rewards. This is done by abstracting finite unfoldings of the infinite belief MDP of the POMDP. The key issue is to find a suitable under-approximation of the value function. We provide two techniques: a simple (cut-off) technique that uses a good policy on the POMDP, and a more advanced technique (belief clipping) that uses minimal shifts of probabilities between beliefs. We use mixed-integer linear programming (MILP) to find such minimal probability shifts and experimentally show that our techniques scale quite well while providing tight lower bounds on the expected total reward.
翻译:我们认为问题在于:在部分可见的Markov决策程序(POMDP)下低于某一阈值时,达到一个目标状态的最佳预期总奖赏是否为最佳的预期总奖赏?我们通过计算这些预期总奖赏的偏差来解决这个问题 -- -- 一般来说是不可估量的。这是通过抽取POMDP无限信念MDP的有限展出来实现的。关键问题是找到一个适当的价值功能的偏差。我们提供了两种技术:一种对POMDP采用良好政策的简单(禁产)技术,一种使用信仰之间最小概率变化的更先进的技术(信仰剪报)。我们使用混合指数线性编程(MILP)来发现这种最小概率变化,并实验性地表明我们的技术规模相当大,同时对预期的总奖项提供更紧凑的下限。