Partially Observable Monte-Carlo Planning (POMCP) is a powerful online algorithm able to generate approximate policies for large Partially Observable Markov Decision Processes. The online nature of this method supports scalability by avoiding complete policy representation. The lack of an explicit representation however hinders policy interpretability and makes policy verification very complex. In this work, we propose two contributions. The first is a method for identifying unexpected actions selected by POMCP with respect to expert prior knowledge of the task. The second is a shielding approach that prevents POMCP from selecting unexpected actions. The first method is based on Satisfiability Modulo Theory (SMT). It inspects traces (i.e., sequences of belief-action-observation triplets) generated by POMCP to compute the parameters of logical formulas about policy properties defined by the expert. The second contribution is a module that uses online the logical formulas to identify anomalous actions selected by POMCP and substitutes those actions with actions that satisfy the logical formulas fulfilling expert knowledge. We evaluate our approach on Tiger, a standard benchmark for POMDPs, and a real-world problem related to velocity regulation in mobile robot navigation. Results show that the shielded POMCP outperforms the standard POMCP in a case study in which a wrong parameter of POMCP makes it select wrong actions from time to time. Moreover, we show that the approach keeps good performance also if the parameters of the logical formula are optimized using trajectories containing some wrong actions.
翻译:部分可观测 Monte-Carlo规划(POMCP)是一个强大的在线算法,它能够为大型部分可观测的Markov 决策进程产生大致的政策。该方法的在线性质通过避免完整的政策代表来支持可扩展性。缺乏明确的表达方式妨碍了政策的可解释性,使得政策核查非常复杂。在这项工作中,我们提出两项意见。第一个意见是确定POMCP所选择的与专家以前对任务了解有关的意外行动的方法。第二个建议是防止POMCP所选择的异常行动的一种保护性方法。第一个方法基于可满足性 Modulo Theory (SMT) 。它检查跟踪(即信仰-行动-观察三重线序列),以避免完整的政策代表完整的政策代表的可解释性,使政策核查十分复杂。第二个建议是使用在线逻辑公式确定POMCP所选择的反常性行动,用符合逻辑公式的行动取代那些符合专家知识的行动。我们评估了我们关于老虎的方法,这是对POMPTheo The droad 的标准化基准,它也显示一个真实的移动动作,它显示一个真实的标准动作,它的一个标准动作在OMOPROPROPPPDLA中显示一个与SLA中显示一个实际的标准动作,一个与Slalalalalalalal Acalalalal ormax