Real-world sequential decision making problems commonly involve partial observability, which requires the agent to maintain a memory of history in order to infer the latent states, plan and make good decisions. Coping with partial observability in general is extremely challenging, as a number of worst-case statistical and computational barriers are known in learning Partially Observable Markov Decision Processes (POMDPs). Motivated by the problem structure in several physical applications, as well as a commonly used technique known as "frame stacking", this paper proposes to study a new subclass of POMDPs, whose latent states can be decoded by the most recent history of a short length $m$. We establish a set of upper and lower bounds on the sample complexity for learning near-optimal policies for this class of problems in both tabular and rich-observation settings (where the number of observations is enormous). In particular, in the rich-observation setting, we develop new algorithms using a novel "moment matching" approach with a sample complexity that scales exponentially with the short length $m$ rather than the problem horizon, and is independent of the number of observations. Our results show that a short-term memory suffices for reinforcement learning in these environments.
翻译:现实世界的顺序决策问题通常涉及部分可观察性,这要求代理人保持历史记忆以推断潜在状态、计划和做出正确的决定。应对部分可部分可观察性一般是极具挑战性的,因为学习部分可观察的Markov 决策程序(POMDPs)中知道一些最坏的统计和计算障碍。受问题结构在若干物理应用中的影响,以及一种通常使用的技术“框架堆叠”的驱动,本文件提议研究一个新的POMDP的子类,其潜在状态可以被短长度的近期历史解码,其潜在状态可以被短时间的美元平面解码。我们为在表格和丰富观察环境中学习这一类问题的抽样复杂程度设置了一套上下限。我们用一种新型的“移动匹配”方法开发了新的算法,其样本复杂程度与短长度的美元比重成倍,而不是问题地平面,并且从短期的记忆环境中独立地展示了我们学习的足够记忆环境。