重建强化学习:重新使用先前的计算方法加快进度 (Reincarnating Reinforcement Learning: Reusing Prior Computation to Accelerate Progress)

Learning tabula rasa, that is without any prior knowledge, is the prevalent workflow in reinforcement learning (RL) research. However, RL systems, when applied to large-scale settings, rarely operate tabula rasa. Such large-scale systems undergo multiple design or algorithmic changes during their development cycle and use ad hoc approaches for incorporating these changes without re-training from scratch, which would have been prohibitively expensive. Additionally, the inefficiency of deep RL typically excludes researchers without access to industrial-scale resources from tackling computationally-demanding problems. To address these issues, we present reincarnating RL as an alternative workflow or class of problem settings, where prior computational work (e.g., learned policies) is reused or transferred between design iterations of an RL agent, or from one RL agent to another. As a step towards enabling reincarnating RL from any agent to any other agent, we focus on the specific setting of efficiently transferring an existing sub-optimal policy to a standalone value-based RL agent. We find that existing approaches fail in this setting and propose a simple algorithm to address their limitations. Equipped with this algorithm, we demonstrate reincarnating RL's gains over tabula rasa RL on Atari 2600 games, a challenging locomotion task, and the real-world problem of navigating stratospheric balloons. Overall, this work argues for an alternative approach to RL research, which we believe could significantly improve real-world RL adoption and help democratize it further. Open-sourced code and trained agents at https://agarwl.github.io/reincarnating_rl.

翻译：没有任何先前知识的学习 tabula rasa,这是强化学习(RL)研究中普遍存在的工作流程。然而,RL系统在应用到大型环境时,很少操作 tabula rasa。这些大型系统在开发周期内经历了多重设计或算法变化,并且使用临时方法将这些变化纳入其中,而无需从零开始再培训,这代价太高了。此外,深度RL效率低下通常使没有获得工业规模资源的研究者无法处理计算需求问题。为了解决这些问题,我们提出将RL作为替代工作流程或问题设置的类别,在以前的计算工作(例如,学习的政策)被重新利用或从一个RL代理商或从一个RL代理商之间转移。作为使RL从任何代理商向任何其他代理商重新吸收RL的一个步骤,我们侧重于将现有的次级最佳政策有效地转移给一个独立、基于价值的Sloausial RL代理商。我们发现,现有的方法无法在这个设置中提出一个简单的计算方法,在26个Sqolual Rqolal 上提出一个挑战性的工作进展。