Although model-based reinforcement learning (RL) approaches are considered more sample efficient, existing algorithms are usually relying on sophisticated planning algorithm to couple tightly with the model-learning procedure. Hence the learned models may lack the ability of being re-used with more specialized planners. In this paper we address this issue and provide approaches to learn an RL model efficiently without the guidance of a reward signal. In particular, we take a plug-in solver approach, where we focus on learning a model in the exploration phase and demand that \emph{any planning algorithm} on the learned model can give a near-optimal policy. Specicially, we focus on the linear mixture MDP setting, where the probability transition matrix is a (unknown) convex combination of a set of existing models. We show that, by establishing a novel exploration algorithm, the plug-in approach learns a model by taking $\tilde{O}(d^2H^3/\epsilon^2)$ interactions with the environment and \emph{any} $\epsilon$-optimal planner on the model gives an $O(\epsilon)$-optimal policy on the original model. This sample complexity matches lower bounds for non-plug-in approaches and is \emph{statistically optimal}. We achieve this result by leveraging a careful maximum total-variance bound using Bernstein inequality and properties specified to linear mixture MDP.
翻译:虽然基于模型的强化学习(RL)方法被认为更有效率,但现有的算法通常依赖复杂的规划算法,与模型学习程序紧密结合。因此,学习的模型可能缺乏与更专业化的规划者重新使用的能力。在本文件中,我们处理这一问题,并提供在没有奖赏信号的指导下有效学习RL模型的方法。特别是,我们采取插接解算法,我们侧重于在探索阶段学习模型,并要求在所学模型上采用\emph{任何规划算法}能够提供接近最佳的政策。我们注重线性混合 MDP 设置,其中概率转换矩阵是一套现有模型的(未知的)组合组合。我们表明,通过建立新的探索算法,插接通方法学习模型,我们采用 $\ text{O}(d2H%3/\epsilon%2) 来学习一个模型,同时要求与环境和\emph{ny{any} (eqourlon-op$-optimener plan) 能够产生一个(未知的)原始的(O\ep) comestimalimalimal imal assimal 方法。我们用这个原始的、不拘谨的模型来取得一个最精度的精细的模型。