We study reinforcement learning with function approximation for large-scale Partially Observable Markov Decision Processes (POMDPs) where the state space and observation space are large or even continuous. Particularly, we consider Hilbert space embeddings of POMDP where the feature of latent states and the feature of observations admit a conditional Hilbert space embedding of the observation emission process, and the latent state transition is deterministic. Under the function approximation setup where the optimal latent state-action $Q$-function is linear in the state feature, and the optimal $Q$-function has a gap in actions, we provide a \emph{computationally and statistically efficient} algorithm for finding the \emph{exact optimal} policy. We show our algorithm's computational and statistical complexities scale polynomially with respect to the horizon and the intrinsic dimension of the feature on the observation space. Furthermore, we show both the deterministic latent transitions and gap assumptions are necessary to avoid statistical complexity exponential in horizon or dimension. Since our guarantee does not have an explicit dependence on the size of the state and observation spaces, our algorithm provably scales to large-scale POMDPs.
翻译:我们研究大规模部分可观测的Markov 决策程序(POMDPs)的功能近似值强化学习。 国家空间和观测空间是巨大甚至连续的。 特别是, 我们考虑POMDP 的Hilbert 空间嵌入过程, 潜伏状态和观测特征的特征允许有条件的Hilbert 空间嵌入观测排放过程, 潜伏状态过渡是决定性的。 在功能近似设置下, 最佳潜伏状态- Q$ 功能在状态特征中是线性的, 而最佳的Q$ 功能在行动上存在差距, 我们为寻找 \ emph{ computation 和统计效率高的POMDP 政策提供了一种计算法和统计法上的算法 。 我们从观察空间的地平面和内在层面来展示我们的计算法和统计复杂程度。 此外, 我们显示, 确定性潜在潜力转变和差距假设对于避免地平面或层面的统计复杂性指数或层面都是必要的。 由于我们的保证并不明显依赖国家和观测空间的大小, 我们的算式的POMDP 至大规模尺度。