While parallelism has been extensively used in Reinforcement Learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature focused on approaches that encourage agents to explore over a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Further, we show that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
翻译:虽然在加强学习(RL)中广泛使用了平行方法,但平行勘探的量化效果在理论上并没有得到很好的理解。我们研究了线性马尔科夫决策流程(MDPs)和双球员零和马尔科夫游戏(MGs)中简单平行探索无报酬RL的好处。 与现有的侧重于鼓励代理人探索多种政策的方法的文献相比,我们表明,使用单一的政策指导所有代理人的勘探,与完全顺序的对应方相比,在所有案例中都足以获得近线性加速。 此外,我们表明,这一简单程序在线性MDP的无报酬环境中是近乎最理想的。 从实际角度看,我们的文件表明,在探索阶段,单一的政策足以且近乎最佳地纳入平行主义。