Several algorithms have been proposed to sample non-uniformly the replay buffer of deep Reinforcement Learning (RL) agents to speed-up learning, but very few theoretical foundations of these sampling schemes have been provided. Among others, Prioritized Experience Replay appears as a hyperparameter sensitive heuristic, even though it can provide good performance. In this work, we cast the replay buffer sampling problem as an importance sampling one for estimating the gradient. This allows deriving the theoretically optimal sampling distribution, yielding the best theoretical convergence speed. Elaborating on the knowledge of the ideal sampling scheme, we exhibit new theoretical foundations of Prioritized Experience Replay. The optimal sampling distribution being intractable, we make several approximations providing good results in practice and introduce, among others, LaBER (Large Batch Experience Replay), an easy-to-code and efficient method for sampling the replay buffer. LaBER, which can be combined with Deep Q-Networks, distributional RL agents or actor-critic methods, yields improved performance over a diverse range of Atari games and PyBullet environments, compared to the base agent it is implemented on and to other prioritization schemes.
翻译:提议采用几种算法,对深强化学习(RL)剂的重置缓冲器进行非统一的取样,以加速学习,但很少提供这些抽样办法的理论基础。 除其他外, 优先的经验重现似乎是一种超参数敏感超光速, 尽管它能够提供良好的性能。 在这项工作中, 我们将缓冲抽样问题作为评估梯度的重要取样方法。 这样可以得出理论上最优化的抽样分布, 产生最佳的理论趋同速度。 根据理想采样方法的知识, 我们展示了优先经验重现的新理论基础。 最佳的抽样分布很棘手, 我们做了一些近似法, 在实践中提供了良好的结果, 并引入了LABER( Large Batch 经验重现) 等简单易编码和高效的方法来取样重播缓冲器。 LaBERBER, 与深QNetworks、分布式RL代理器或演员- critict 方法相结合, 与基础代理实施的其他优先排序计划相比, 能够提高不同系列阿塔里游戏和PyBullet环境的性能。