B-可承受性:统一结构条件和精选样本有效数值 (Partially Observable RL with B-Stability: Unified Structural Condition and Sharp Sample-Efficient Algorithms)

Partial Observability -- where agents can only observe partial information about the true underlying state of the system -- is ubiquitous in real-world applications of Reinforcement Learning (RL). Theoretically, learning a near-optimal policy under partial observability is known to be hard in the worst case due to an exponential sample complexity lower bound. Recent work has identified several tractable subclasses that are learnable with polynomial samples, such as Partially Observable Markov Decision Processes (POMDPs) with certain revealing or decodability conditions. However, this line of research is still in its infancy, where (1) unified structural conditions enabling sample-efficient learning are lacking; (2) existing sample complexities for known tractable subclasses are far from sharp; and (3) fewer sample-efficient algorithms are available than in fully observable RL. This paper advances all three aspects above for Partially Observable RL in the general setting of Predictive State Representations (PSRs). First, we propose a natural and unified structural condition for PSRs called \emph{B-stability}. B-stable PSRs encompasses the vast majority of known tractable subclasses such as weakly revealing POMDPs, low-rank future-sufficient POMDPs, decodable POMDPs, and regular PSRs. Next, we show that any B-stable PSR can be learned with polynomial samples in relevant problem parameters. When instantiated in the aforementioned subclasses, our sample complexities improve substantially over the current best ones. Finally, our results are achieved by three algorithms simultaneously: Optimistic Maximum Likelihood Estimation, Estimation-to-Decisions, and Model-Based Optimistic Posterior Sampling. The latter two algorithms are new for sample-efficient learning of POMDPs/PSRs.

翻译：部分可观察性 -- -- 代理商只能观察关于系统真实基础状态的部分信息 -- -- 在强化学习(RL)的实际应用中,这种系统只能观察系统真实基础状态的部分信息 -- -- 是无处不在的。理论上,在部分可观察性下学习接近最佳的政策,在最坏的情况下,由于指数样本复杂度较低,已知在部分可观察性复杂度下很难学习。最近的工作确定了几个可移动的子类,这些子类可以与多元性样本一起学习,例如部分可观察性Markov决定程序(POMDPs),具有某些显示性或衰变性条件。然而,这一研究线仍处于初始阶段,有(1) 统一的结构条件,能够同时进行抽样效率学习;(2) 已知可移动性亚类的模型现有样本复杂性远远不那么尖锐;(3) 样本高效的算法比完全可观察的RL值少。本文在“可观察性国家代表性”的总体设置中,可以提出一个自然和统一的结构性结构状况,称为新标准 {Bst-stable sable sqity ;B-deal requidistress;B-Slent POMs deal deal deal sal sal sal sal sh sh sh shal shal sh shal sholveal laveal sh sh shald shal sal sh sh laveal shald sh sal sh sh sh shal sholveald shal shaldaldalddaldaldaldaldaldddddddddds mas 。