We study reinforcement learning for partially observed Markov decision processes (POMDPs) with infinite observation and state spaces, which remains less investigated theoretically. To this end, we make the first attempt at bridging partial observability and function approximation for a class of POMDPs with a linear structure. In detail, we propose a reinforcement learning algorithm (Optimistic Exploration via Adversarial Integral Equation or OP-TENET) that attains an $\epsilon$-optimal policy within $O(1/\epsilon^2)$ episodes. In particular, the sample complexity scales polynomially in the intrinsic dimension of the linear structure and is independent of the size of the observation and state spaces. The sample efficiency of OP-TENET is enabled by a sequence of ingredients: (i) a Bellman operator with finite memory, which represents the value function in a recursive manner, (ii) the identification and estimation of such an operator via an adversarial integral equation, which features a smoothed discriminator tailored to the linear structure, and (iii) the exploration of the observation and state spaces via optimism, which is based on quantifying the uncertainty in the adversarial integral equation.
翻译:我们研究部分观测到的Markov决定过程(POMDPs)的强化学习,该过程具有无限的观测和状态空间,在理论上仍然较少调查。为此,我们首次尝试弥合具有线性结构的一类POMDPs的局部可观察性和功能近似性。我们详细建议一个强化学习算法(通过反向综合计算或OP-TENET进行亲善探索),在$O(1/\\epsilon)的范围内实现美元最佳政策,特别是线性结构内在层面的样本复杂性多比重,与观测和状态空间的大小无关。OP-TENET的样本效率是由一系列要素促成的:(一) 具有有限内存的贝尔曼操作器,它代表着一种循环方式的价值功能,(二) 通过一个具有线性结构的平滑式分析器,确定和估计这种操作器,以及(三) 通过乐观度来探索观测和状态空间,其基础是量化对等式中的整体不确定性。