Training deep reinforcement learning (DRL) models usually requires high computation costs. Therefore, compressing DRL models possesses immense potential for training acceleration and model deployment. However, existing methods that generate small models mainly adopt the knowledge distillation-based approach by iteratively training a dense network. As a result, the training process still demands massive computing resources. Indeed, sparse training from scratch in DRL has not been well explored and is particularly challenging due to non-stationarity in bootstrap training. In this work, we propose a novel sparse DRL training framework, "the Rigged Reinforcement Learning Lottery" (RLx2), which builds upon gradient-based topology evolution and is capable of training a sparse DRL model based entirely on a sparse network. Specifically, RLx2 introduces a novel multi-step TD target mechanism with a dynamic-capacity replay buffer to achieve robust value learning and efficient topology exploration in sparse models. It also reaches state-of-the-art sparse training performance in several tasks, showing 7.5\times-20\times model compression with less than 3% performance degradation and up to 20\times and 50\times FLOPs reduction for training and inference, respectively.
翻译:深度强化培训学习模式通常要求很高的计算成本。因此,压缩DRL模型具有巨大的培训加速和模型部署潜力。但是,现有的生成小模型的方法主要采用知识蒸馏法,反复培训一个密集的网络。因此,培训进程仍然需要大量的计算资源。事实上,DRL从零到零的训练没有很好地探索,而且由于靴带训练中的非静态性,因此特别具有挑战性。在这项工作中,我们提出了一个新的稀疏的DRL培训框架,即“强化强化学习彩票”(RLx2),它以基于梯度的地形演变为基础,能够完全在稀少的网络上培训稀薄的DRL模型。具体地说,RLx2引入了一个新的多步骤TD目标机制,具有动态能力回放缓冲,以便在稀疏模式中实现强大的价值学习和高效的表层探索。在几项任务中,它也达到了最先进的稀疏培训业绩,显示7.5\时间-20时间模型压缩,低于3%的绩效退化,在20\时间和50-时间分别进行FOP-Lerview。</s>