The growth of deep reinforcement learning (RL) has brought multiple exciting tools and methods to the field. This rapid expansion makes it important to understand the interplay between individual elements of the RL toolbox. We approach this task from an empirical perspective by conducting a study in the continuous control setting. We present multiple insights of fundamental nature, including: an average of multiple actors trained from the same data boosts performance; the existing methods are unstable across training runs, epochs of training, and evaluation runs; a commonly used additive action noise is not required for effective training; a strategy based on posterior sampling explores better than the approximated UCB combined with the weighted Bellman backup; the weighted Bellman backup alone cannot replace the clipped double Q-Learning; the critics' initialization plays the major role in ensemble-based actor-critic exploration. As a conclusion, we show how existing tools can be brought together in a novel way, giving rise to the Ensemble Deep Deterministic Policy Gradients (ED2) method, to yield state-of-the-art results on continuous control tasks from OpenAI Gym MuJoCo. From the practical side, ED2 is conceptually straightforward, easy to code, and does not require knowledge outside of the existing RL toolbox.
翻译:深度强化学习(RL)的增长给实地带来了许多令人兴奋的工具和方法。这种快速扩展使得了解RL工具箱中各个元素之间的相互作用非常重要。我们从经验的角度看待这项任务,方法是在连续控制环境下进行一项研究。我们展示了多种根本性的洞察力,包括:从同一数据提升性能中培训的多个行为者的平均数;现有方法在培训运行、培训和评估运行中不稳定;有效培训不需要一种常用的添加行动噪音;基于远洋取样的策略比近似UCBB和加权Bellman备份相结合的近似UCBB(ED2)更好;加权Bellman备份单靠边端不能取代剪切的双轨学习;批评者初始化在基于共同点的行为体-critic 探索中起着主要作用。作为结论,我们展示了现有工具如何以新颖的方式组合在一起,从而产生出“深确定性政策精度”(ED2)的集合(ED2)方法,从而产生从O-AI Gym MuJoCO的连续控制任务方面的成果,而不是直接的理论工具。