Partially Observable Monte Carlo Planning (POMCP) is an efficient solver for Partially Observable Markov Decision Processes (POMDPs). It allows scaling to large state spaces by computing an approximation of the optimal policy locally and online, using a Monte Carlo Tree Search based strategy. However, POMCP suffers from sparse reward function, namely, rewards achieved only when the final goal is reached, particularly in environments with large state spaces and long horizons. Recently, logic specifications have been integrated into POMCP to guide exploration and to satisfy safety requirements. However, such policy-related rules require manual definition by domain experts, especially in real-world scenarios. In this paper, we use inductive logic programming to learn logic specifications from traces of POMCP executions, i.e., sets of belief-action pairs generated by the planner. Specifically, we learn rules expressed in the paradigm of answer set programming. We then integrate them inside POMCP to provide soft policy bias toward promising actions. In the context of two benchmark scenarios, rocksample and battery, we show that the integration of learned rules from small task instances can improve performance with fewer Monte Carlo simulations and in larger task instances. We make our modified version of POMCP publicly available at https://github.com/GiuMaz/pomcp_clingo.git.
翻译:部分可观察的蒙特卡洛规划(POMCP)是部分可观察的Markov决策程序(POMDPs)的一个高效解决方案,它通过使用蒙特卡洛树搜索战略,在本地和网上计算最佳政策的近似近似度,可以向大州空间扩展,但POMCP却有微弱的奖励功能,即:只有当最终目标达到时,特别是在国家空间和长视野大的环境中,才能获得奖励;最近,逻辑规格已经纳入POMCP,以指导勘探和满足安全要求;然而,这种与政策有关的规则需要域专家手工界定,特别是在现实世界情景中。在本文中,我们使用直观逻辑编程序,从POMCP处决的痕迹中学习逻辑规范,即规划者产生的几套信仰-行动配对。具体地说,我们学习了在回答程序范式中表达的规则。我们随后将其纳入POMCP,以便为有前途的行动提供软的政策偏向。在两种基准情景、岩石和电池中,我们表明从小任务实例中学习的规则的整合可以改进我们MLAGM/GOM的公开模拟和任务实例。</s>