POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a Hindsight Observable Markov Decision Process (HOMDP) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
翻译:POMDPs捕捉了广泛的决策问题,但硬性结果表明,即使由于固有的局部可观察性,学习在简单环境中也是难以解决的。然而,在许多现实的问题中,更多的信息要么被披露,要么可以在学习过程的某个点上计算出来。在从机器人到数据中心时间安排等各种应用的推动下,我们制定了一个Hindsight可观测的Markov 决策程序(HOMDP),作为POMDP,其中潜伏状态在事后和训练期间被披露给学习者。我们为表格和功能近似设置引入了新的算法,这种算法具有明显的抽样效率,具有事后可观察的可观察性,即使在POMDPs中,否则在统计上是难以操作的。我们给出了一个较低的界限,表明表格算法在依赖潜在状态和观察基点方面是最佳的。