Low-precision training has become a popular approach to reduce compute requirements, memory footprint, and energy consumption in supervised learning. In contrast, this promising approach has not yet enjoyed similarly widespread adoption within the reinforcement learning (RL) community, partly because RL agents can be notoriously hard to train even in full precision. In this paper we consider continuous control with the state-of-the-art SAC agent and demonstrate that a na\"ive adaptation of low-precision methods from supervised learning fails. We propose a set of six modifications, all straightforward to implement, that leaves the underlying agent and its hyperparameters unchanged but improves the numerical stability dramatically. The resulting modified SAC agent has lower memory and compute requirements while matching full-precision rewards, demonstrating that low-precision training can substantially accelerate state-of-the-art RL without parameter tuning.
翻译:低精度培训已成为减少受监督学习中的计算要求、记忆足迹和能源消耗的流行方法,与此形成对照的是,这一有希望的方法在强化学习(RL)社区内尚未获得类似的广泛采用,部分原因是即使完全精确,RL代理商也很难进行臭名昭著的培训。在本文中,我们认为与最先进的SAC代理商保持连续控制,并表明从受监督学习中对低精度方法的适应失败了。我们建议了一套六种修改,这些修改都直截了当地执行,使底剂及其超参数保持不变,但大大改善了数字稳定性。由此产生的修改的SAC代理商的记忆和计算要求较低,同时与全面精度奖励相匹配,表明低精度培训可以大大加快受监督学习的RL状态,而无需参数调整。