Deep reinforcement learning has achieved significant milestones, however, the computational demands of reinforcement learning training and inference remain substantial. Quantization is an effective method to reduce the computational overheads of neural networks, though in the context of reinforcement learning, it is unknown whether quantization's computational benefits outweigh the accuracy costs introduced by the corresponding quantization error. To quantify this tradeoff we perform a broad study applying quantization to reinforcement learning. We apply standard quantization techniques such as post-training quantization (PTQ) and quantization aware training (QAT) to a comprehensive set of reinforcement learning tasks (Atari, Gym), algorithms (A2C, DDPG, DQN, D4PG, PPO), and models (MLPs, CNNs) and show that policies may be quantized to 8-bits without degrading reward, enabling significant inference speedups on resource-constrained edge devices. Motivated by the effectiveness of standard quantization techniques on reinforcement learning policies, we introduce a novel quantization algorithm, \textit{ActorQ}, for quantized actor-learner distributed reinforcement learning training. By leveraging full precision optimization on the learner and quantized execution on the actors, \textit{ActorQ} enables 8-bit inference while maintaining convergence. We develop a system for quantized reinforcement learning training around \textit{ActorQ} and demonstrate end to end speedups of $>$ 1.5 $\times$ - 2.5 $\times$ over full precision training on a range of tasks (Deepmind Control Suite). Finally, we break down the various runtime costs of distributed reinforcement learning training (such as communication time, inference time, model load time, etc) and evaluate the effects of quantization on these system attributes.
翻译:深度加固学习达到了重要的里程碑,但是,强化学习培训和推断的计算需求仍然很大。量化是减少神经网络计算间接费用的有效方法,尽管在加固学习方面,尚不清楚量化的计算效益是否超过相应的量化错误带来的准确成本。为了量化这一权衡,我们进行了一项广泛的研究,将量化用于强化学习,我们采用了标准量化技术,如培训后四分制(PTQ)和量化认知培训(QAT)等标准量化技术,用于一套全面的强化学习任务(Atari,Gym),算法(A2C,DDPG,DQN,D4PG,PPO)和模型(MLP,CNNs)的计算效益是否大于相应的量化成本。为了在强化学习政策上大幅加固化模型,我们在强化学习政策中引入新的四分级算法(Atricional decental decregility developmentality),在升级时,在升级后进行最后再升级学习。