Despite the success of reinforcement learning (RL) for Markov decision processes (MDPs) with function approximation, most RL algorithms easily fail if the agent only has partial observations of the state. Such a setting is often modeled as a partially observable Markov decision process (POMDP). Existing sample-efficient algorithms for POMDPs are restricted to the tabular setting where the state and observation spaces are finite. In this paper, we make the first attempt at tackling the tension between function approximation and partial observability. In specific, we focus on a class of undercomplete POMDPs with linear function approximations, which allows the state and observation spaces to be infinite. For such POMDPs, we show that the optimal policy and value function can be characterized by a sequence of finite-memory Bellman operators. We propose an RL algorithm that constructs optimistic estimators of these operators via reproducing kernel Hilbert space (RKHS) embedding. Moreover, we theoretically prove that the proposed algorithm finds an $\varepsilon$-optimal policy with $\tilde O (1/\varepsilon^2)$ episodes of exploration. Also, this sample complexity only depends on the intrinsic dimension of the POMDP polynomially and is independent of the size of the state and observation spaces. To our best knowledge, we develop the first provably sample-efficient algorithm for POMDPs with function approximation.
翻译:尽管对 Markov 决策程序( MDPs) 的强化学习(RL) 取得了成功,但大多数 RL 算法如果代理商只对州进行部分观察,就很容易失败。这种设置往往以部分可见的 Markov 决策程序(POMDP)为模型。POMDP 的现有抽样效率算法仅限于州和观察空间有限的表格设置。在本文中,我们第一次尝试解决功能近似和部分可观察性之间的紧张关系。具体地说,我们侧重于一类不完全的POMDP,有线性函数近似,使州和观察空间无限。对于这种POMDP,我们展示了最佳政策和价值功能的特征,其特征是一定的模样程序Bellman操作员。我们提议了一个RL 算法,通过再生产 内流 Hilbert 空间( RKHS) 的嵌入, 来构建这些操作员的乐观的估测值。此外,我们理论上证明,提议的算法首先找到一个以美元和美元基数为最佳基数的政策,让我们的O (1/D OM ) 和TOM 的内基级观察系统的精度, 的精度的精度, 的精度的精度, 的精度的深度的精度, 也决定了我们探索的精度。