In this article, we discuss how to solve information-gathering problems expressed as rho-POMDPs, an extension of Partially Observable Markov Decision Processes (POMDPs) whose reward rho depends on the belief state. Point-based approaches used for solving POMDPs have been extended to solving rho-POMDPs as belief MDPs when its reward rho is convex in B or when it is Lipschitz-continuous. In the present paper, we build on the POMCP algorithm to propose a Monte Carlo Tree Search for rho-POMDPs, aiming for an efficient on-line planner which can be used for any rho function. Adaptations are required due to the belief-dependent rewards to (i) propagate more than one state at a time, and (ii) prevent biases in value estimates. An asymptotic convergence proof to epsilon-optimal values is given when rho is continuous. Experiments are conducted to analyze the algorithms at hand and show that they outperform myopic approaches.
翻译:在本篇文章中,我们讨论了如何解决以rho-POMDPs(部分可观察的Markov决定程序(部分可观察的Markov决定程序)的延伸,其奖赏取决于信仰状态。解决POMDPs(POMDPs)的点基方法已经推广到解决rho-POMDPs(信仰MDPs),当其奖赏在B中是混凝土或Lipschitz持续时,作为信仰MDPs(信仰MDPs)的根基方法。在本文件中,我们利用POMCP算法,提议对rho-POMDPs进行蒙特卡洛树搜索(蒙特卡洛树搜索),目的是建立一个有效的在线规划器,用于任何rho功能。由于依赖信仰的奖励:(一) 一次宣传不止一个州,以及(二) 防止价值估计中的偏差,因此需要适应。当Rho持续时,提供普西龙-optimal价值的微缩证据。我们进行了实验,以分析手算法并显示它们超越了近视方法。