Finding the minimal structural assumptions that empower sample-efficient learning is one of the most important research directions in Reinforcement Learning (RL). This paper advances our understanding of this fundamental question by introducing a new complexity measure -- Bellman Eluder (BE) dimension. We show that the family of RL problems of low BE dimension is remarkably rich, which subsumes a vast majority of existing tractable RL problems including but not limited to tabular MDPs, linear MDPs, reactive POMDPs, low Bellman rank problems as well as low Eluder dimension problems. This paper further designs a new optimization-based algorithm -- GOLF, and reanalyzes a hypothesis elimination-based algorithm -- OLIVE (proposed in Jiang et al., 2017). We prove that both algorithms learn the near-optimal policies of low BE dimension problems in a number of samples that is polynomial in all relevant parameters, but independent of the size of state-action space. Our regret and sample complexity results match or improve the best existing results for several well-known subclasses of low BE dimension problems.
翻译:找到增强抽样效率学习的最低限度结构假设,是加强学习(RL)最重要的研究方向之一。本文件通过引入新的复杂度 -- -- Bellman Eluder(BE) 维度 -- -- 增进了我们对这个根本问题的了解。我们表明,低BE维度RL问题的家庭非常丰富,它囊括了绝大多数现有的可移植RL问题,包括但不限于表格式MDP、线性MDP、反应式POMDP、低贝尔曼等级问题以及低 Eluder维度问题。本文还设计了一种新的基于优化的算法 -- -- GOLF, 并重新分析基于假设的消除算法 -- -- OLIVE(在江等人(2017年)提出)。我们证明,这两种算法都学习了在所有相关参数上都是多元的样本中的低BE维度问题近乎最佳的政策,但与国家行动空间的大小无关。我们的遗憾和抽样复杂程度使得一些众所周知的低BE维度问题的子类现有最佳结果相匹配或改进。