In reward-free reinforcement learning (RL), an agent explores the environment first without any reward information, in order to achieve certain learning goals afterwards for any given reward. In this paper we focus on reward-free RL under low-rank MDP models, in which both the representation and linear weight vectors are unknown. Although various algorithms have been proposed for reward-free low-rank MDPs, the corresponding sample complexity is still far from being satisfactory. In this work, we first provide the first known sample complexity lower bound that holds for any algorithm under low-rank MDPs. This lower bound implies it is strictly harder to find a near-optimal policy under low-rank MDPs than under linear MDPs. We then propose a novel model-based algorithm, coined RAFFLE, and show it can both find an $\epsilon$-optimal policy and achieve an $\epsilon$-accurate system identification via reward-free exploration, with a sample complexity significantly improving the previous results. Such a sample complexity matches our lower bound in the dependence on $\epsilon$, as well as on $K$ in the large $d$ regime, where $d$ and $K$ respectively denote the representation dimension and action space cardinality. Finally, we provide a planning algorithm (without further interaction with true environment) for RAFFLE to learn a near-accurate representation, which is the first known representation learning guarantee under the same setting.
翻译:在无奖励信息的条件下,无奖励强化学习要求智能体首先探索环境,以在后续任意给定奖励下实现特定的学习目标。本文关注低秩马尔可夫决策过程(MDP)下无奖励强化学习,其中表示和线性权重向量均未知。尽管已经提出了各种用于低秩MDP的无奖励强化学习算法,但其所对应的样本复杂度仍远未令人满意。本文首先提供了已知的首个低秩MDP下的样本复杂度下界,该下界意味着实现低秩MDP下的近最优策略比实现线性MDP下的近最优策略更难。然后,我们提出了一种新型的基于模型的RAFFLE算法,并证明它可以通过无奖励探索找到一个ε-最优策略,以及通过无奖励探索实现ε-准确系统识别,其样本复杂度显著优于之前的结果。这样的样本复杂度在ε的依赖性及在大的d阶段(其中d和K分别表示表示维数和动作空间基数)上与下界相匹配。最后,我们提供了一种规划算法(不再与真实环境相互作用)用于RAFFLE来学习近似准确的表示,这是相同设置下已知的首个表示学习保证。