While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL for linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is minimax optimal up to logarithmic factors in the reward-free setting for both linear MDPs and two-player zero-sum MGs. From a practical perspective, our paper shows that a single policy is sufficient and provably optimal for incorporating parallelism during the exploration phase.
翻译:虽然在加强学习(RL)中广泛使用了平行方法,但平行勘探的量化效果在理论上并没有得到很好的理解。我们研究了线性马尔科夫决定程序(MDPs)和双球员马科夫游戏(MGs)对无报酬无报酬RL的简单平行探索的好处。 与现有的侧重于鼓励代理人探索多种政策的方法的文献相比,我们表明,使用单一的政策指导所有代理人的勘探活动,足以在所有案例中实现几乎线性的速度,而与其完全相继的对应方相比。 此外,我们表明,这一简单程序最优于线性MDP(MDPs)和双球员零和MGs(MGs)无报酬设置中的对数因素。 从实际角度看,我们的文件表明,在勘探阶段采用单一的政策是足够和可行的最佳方法。