Partially Observable Markov Decision Process (POMDP) is a framework applicable to many real world problems. In this work, we propose an approach to solve POMDPs with multimodal belief by relying on a policy that solves the fully observable version. By defininig a new, mixture value function based on the value function from the fully observable variant, we can use the corresponding greedy policy to solve the POMDP itself. We develop the mathematical framework necessary for discussion, and introduce a benchmark built on the task of Reconnaissance Blind TicTacToe. On this benchmark, we show that our policy outperforms policies ignoring the existence of multiple modes.
翻译:部分可观察的 Markov 决策程序(POMDP) 是一个适用于许多现实世界问题的框架。 在这项工作中,我们提出一种方法,通过依赖解决完全可观察版本的政策,以多式信仰解决POMDP。通过定义基于完全可观察变量的价值函数的新的混合价值函数,我们可以使用相应的贪婪政策来解决POMDP本身。我们开发了讨论所需的数学框架,并引入了建立在侦察盲人TicTacToe任务基础上的基准。在这个基准上,我们表明我们的政策优于无视多种模式存在的政策。