Partially Observable Markov Decision Processes (POMDPs) are useful tools to model environments where the full state cannot be perceived by an agent. As such the agent needs to reason taking into account the past observations and actions. However, simply remembering the full history is generally intractable due to the exponential growth in the history space. Keeping a probability distribution that models the belief over what the true state is can be used as a sufficient statistic of the history, but its computation requires access to the model of the environment and is also intractable. Current state-of-the-art algorithms use Recurrent Neural Networks (RNNs) to compress the observation-action history aiming to learn a sufficient statistic, but they lack guarantees of success and can lead to suboptimal policies. To overcome this, we propose the Wasserstein-Belief-Updater (WBU), an RL algorithm that learns a latent model of the POMDP and an approximation of the belief update. Our approach comes with theoretical guarantees on the quality of our approximation ensuring that our outputted beliefs allow for learning the optimal value function.
翻译:部分可观察的 Markov 决策程序( POMDPs) 是模拟环境的有用工具, 使整个状态无法被代理方看到。 因此, 代理方需要考虑到以往的观察和行动, 以理性为根据。 但是, 简单的回忆整个历史一般由于历史空间的指数增长而难以解决。 保持一种概率分布, 将真实状态的信念模型用作对历史的充分统计, 但其计算需要对环境模型的利用, 并且也是难以操作的。 目前最先进的算法使用经常性神经网络( RNN) 来压缩观察行动历史, 以学习足够的统计数据, 但是它们缺乏成功保证, 并可能导致不完善的政策。 为了克服这一点, 我们提议了瓦塞斯坦- 贝利夫- 更新( WBUB), 一种RL 算法, 学习POMDP 的潜在模型, 以及更新信仰的近似值。 我们的方法是在理论上保证我们的近似质量, 以确保我们输出的信念允许学习最优值功能。</s>