This paper studies the fundamental limits of reinforcement learning (RL) in the challenging \emph{partially observable} setting. While it is well-established that learning in Partially Observable Markov Decision Processes (POMDPs) requires exponentially many samples in the worst case, a surge of recent work shows that polynomial sample complexities are achievable under the \emph{revealing condition} -- A natural condition that requires the observables to reveal some information about the unobserved latent states. However, the fundamental limits for learning in revealing POMDPs are much less understood, with existing lower bounds being rather preliminary and having substantial gaps from the current best upper bounds. We establish strong PAC and regret lower bounds for learning in revealing POMDPs. Our lower bounds scale polynomially in all relevant problem parameters in a multiplicative fashion, and achieve significantly smaller gaps against the current best upper bounds, providing a solid starting point for future studies. In particular, for \emph{multi-step} revealing POMDPs, we show that (1) the latent state-space dependence is at least $\Omega(S^{1.5})$ in the PAC sample complexity, which is notably harder than the $\widetilde{\Theta}(S)$ scaling for fully-observable MDPs; (2) Any polynomial sublinear regret is at least $\Omega(T^{2/3})$, suggesting its fundamental difference from the \emph{single-step} case where $\widetilde{O}(\sqrt{T})$ regret is achievable. Technically, our hard instance construction adapts techniques in \emph{distribution testing}, which is new to the RL literature and may be of independent interest.
翻译:本文研究了挑战性 {emph{{ 部分可观测} 设置中强化学习( RL) 的基本限制 。 虽然在部分可观测的 Markov 决策进程( POMDPs) 中学习要求最坏的样本数量成倍增加, 但最近工作的激增表明, 在\ emph{ revaling 条件} 下, 多式样本复杂性是可以实现的 -- 一个自然条件, 需要观测显示一些关于未观测的潜在状态的信息。 然而, 在显示 POMDPs 时学习的基本限制远不那么容易理解, 现有的较低范围是相当初步的, 并且与当前最佳的 imph( review) 相比, 我们建立强大的 PAC 并后悔在披露 POMDPs 中学习。 我们较低的范围在所有相关的问题参数中以多复制方式衡量多式的多式比例, 并且与当前最佳的上层界限相比差距要小得多, 为未来研究提供一个坚实的起点。 特别是, 对于显示 POMDP 实例, 我们显示, 最隐性的国家- 空间依赖性 最起码的 Rem\\\\\\\\ roest prest exest rodustrate 。