POMDPs capture a broad class of decision making problems, but hardness results suggest that learning is intractable even in simple settings due to the inherent partial observability. However, in many realistic problems, more information is either revealed or can be computed during some point of the learning process. Motivated by diverse applications ranging from robotics to data center scheduling, we formulate a \setting (\setshort) as a POMDP where the latent states are revealed to the learner in hindsight and only during training. We introduce new algorithms for the tabular and function approximation settings that are provably sample-efficient with hindsight observability, even in POMDPs that would otherwise be statistically intractable. We give a lower bound showing that the tabular algorithm is optimal in its dependence on latent state and observation cardinalities.
翻译:POMDP捕捉了广泛的决策问题,但硬性结果表明,即使由于固有的局部可观察性,在简单环境中,学习也是难以解决的。然而,在许多现实的问题中,更多的信息要么被披露,要么可以在学习过程的某个点上计算出来。在从机器人到数据中心时间安排等各种应用的驱动下,我们将一个设置(\seshort)作为POMDP,其中潜伏状态在事后和训练期间被披露给学习者。我们为表格和功能近似设置引入了新的算法,这种算法具有明显的抽样效率,与近似可观察到的可观察性相当有效,即使在POMDP中,否则会难以统计。我们给出了一个较低的约束,表明表格算法在依赖潜伏状态和观察基本点方面是最佳的。