This paper introduces a simple efficient learning algorithms for general sequential decision making. The algorithm combines Optimism for exploration with Maximum Likelihood Estimation for model estimation, which is thus named OMLE. We prove that OMLE learns the near-optimal policies of an enormously rich class of sequential decision making problems in a polynomial number of samples. This rich class includes not only a majority of known tractable model-based Reinforcement Learning (RL) problems (such as tabular MDPs, factored MDPs, low witness rank problems, tabular weakly-revealing/observable POMDPs and multi-step decodable POMDPs), but also many new challenging RL problems especially in the partially observable setting that were not previously known to be tractable. Notably, the new problems addressed by this paper include (1) observable POMDPs with continuous observation and function approximation, where we achieve the first sample complexity that is completely independent of the size of observation space; (2) well-conditioned low-rank sequential decision making problems (also known as Predictive State Representations (PSRs)), which include and generalize all known tractable POMDP examples under a more intrinsic representation; (3) general sequential decision making problems under SAIL condition, which unifies our existing understandings of model-based RL in both fully observable and partially observable settings. SAIL condition is identified by this paper, which can be viewed as a natural generalization of Bellman/witness rank to address partial observability. This paper also presents a reward-free variant of OMLE algorithm, which learns approximate dynamic models that enable the computation of near-optimal policies for all reward functions simultaneously.
翻译:本文为一般顺序决策引入了简单的高效学习算法。 算法将最佳探索和最大相似度模型估算的优化估算相结合, 因此称为OMLE 。 我们证明, OMLE 在一个多元样本中, 学会了极富的顺序决策问题类的近乎最佳的政策。 这个丰富类别不仅包括已知的基于模型的强化学习问题( 包括表格式 MDP、 系数式 MDPs、 低证人级别问题、 列表式弱反应/ 可见度模型估算和多步式可辨识性POMDPs ), 而且还包括许多具有挑战性的 RLLL 问题, 特别是以前不为人所知的局部观察性环境。 值得注意的是, 本文处理的新问题包括:(1) 观测到的POMDPs 能够持续观察和功能近似近似性( RLL), 我们第一次的抽样复杂度完全独立于观测空间的大小; (2) 精度低级顺序决策问题( 也称为离近POMDP( POMDP ) 和多步式可辨识的可辨测的PL, 其中部分的精确度( IML ) 解释性模型 的精度, 的精度 的精度解释性模型 的精度解释性模型, 的精度 的精度, 的精度, 的精度 将整个的精度 的精度 的精度 的精度 的精度 的精度 的精度 的精度 度 度 的逻辑性 度 度 的逻辑性能度, 的逻辑性能度 的逻辑性能性能度 性能度分析性能, 的精度分析性能 的精度 性能性能 的精度 的精度 的精度 性能性 性 性 性 性 性 性 性 性 性 性能 性能 性能 性 性能 性 性 性 性 性 性 性 性 性 性 性 性 性 性 性能 性 性 性能性能性能性能 性能 性能 性能性能性能 性能性能