Although parallelism has been extensively used in reinforcement learning (RL), the quantitative effects of parallel exploration are not well understood theoretically. We study the benefits of simple parallel exploration for reward-free RL in linear Markov decision processes (MDPs) and two-player zero-sum Markov games (MGs). In contrast to the existing literature, which focuses on approaches that encourage agents to explore a diverse set of policies, we show that using a single policy to guide exploration across all agents is sufficient to obtain an almost-linear speedup in all cases compared to their fully sequential counterpart. Furthermore, we demonstrate that this simple procedure is near-minimax optimal in the reward-free setting for linear MDPs. From a practical perspective, our paper shows that a single policy is sufficient and provably near-optimal for incorporating parallelism during the exploration phase.
翻译:虽然在强化学习(RL)中广泛使用了平行方法,但平行勘探的量化效果在理论上并没有得到很好的理解,我们研究了线性马尔科夫决策流程(MDPs)和双球员零和马尔科夫游戏(MGs)中简单平行探索无报酬RL的好处。 与现有文献相比,现有文献侧重于鼓励代理商探索一套不同政策的方法,我们表明,使用单一的政策指导所有代理商的勘探,与完全顺序的对应方相比,在所有案例中都足以获得近线性加速。 此外,我们证明,这一简单程序在线性多边发展方案无报酬环境下是接近最理想的。 从实际角度看,我们的文件表明,在探索阶段,单一的政策足以且近乎最佳地纳入平行政策。</s>