As representation learning becomes a powerful technique to reduce sample complexity in reinforcement learning (RL) in practice, theoretical understanding of its advantage is still limited. In this paper, we theoretically characterize the benefit of representation learning under the low-rank Markov decision process (MDP) model. We first study multitask low-rank RL (as upstream training), where all tasks share a common representation, and propose a new multitask reward-free algorithm called REFUEL. REFUEL learns both the transition kernel and the near-optimal policy for each task, and outputs a well-learned representation for downstream tasks. Our result demonstrates that multitask representation learning is provably more sample-efficient than learning each task individually, as long as the total number of tasks is above a certain threshold. We then study the downstream RL in both online and offline settings, where the agent is assigned with a new task sharing the same representation as the upstream tasks. For both online and offline settings, we develop a sample-efficient algorithm, and show that it finds a near-optimal policy with the suboptimality gap bounded by the sum of the estimation error of the learned representation in upstream and a vanishing term as the number of downstream samples becomes large. Our downstream results of online and offline RL further capture the benefit of employing the learned representation from upstream as opposed to learning the representation of the low-rank model directly. To the best of our knowledge, this is the first theoretical study that characterizes the benefit of representation learning in exploration-based reward-free multitask RL for both upstream and downstream tasks.
翻译:随着代表学习成为在实践中减少强化学习(RL)的抽样复杂性的强大技术,对其优势的理论理解仍然有限。在本文中,我们理论上将代表学习的好处定性为在低级马尔科夫决策程序(MDP)模式下进行的代表学习。我们首先研究多任务低级RL(作为上游培训),所有任务都具有共同的代表权,然后提出新的多任务无报酬算法,称为REFUEL。REFUEL既学习过渡核心和每项任务接近最佳的政策,又为下游任务提供一种获得良好经验的代表权。我们的结果显示,多任务代表学习比单独学习每项任务都具有明显的抽样效率。我们首先研究多任务,只要任务的总数超过一定的门槛,我们首先研究多任务,然后研究下游任务(作为上下游培训的一部分),然后研究与上游任务共享相同代表性的新任务。对于在线和下游任务而言,我们开发一种抽样效率的算法,并显示它发现一种接近最优化的政策,而下游任务与下游任务之间的亚性差距比,只要任务的总数超过一定的样本,我们从上下游代表的下游学习的下游利益,在下游评估中学习的下游利益,我们从下游评估的下游评估中学习了下游利益,从下游成果中学习后,我们最深层评估的收益,从下游任务中学习了下游评估的收益。