Reinforcement learning algorithms are highly sensitive to the choice of hyperparameters, typically requiring significant manual effort to identify hyperparameters that perform well on a new domain. In this paper, we take a step towards addressing this issue by using metagradients to automatically adapt hyperparameters online by meta-gradient descent (Xu et al., 2018). We apply our algorithm, Self-Tuning Actor-Critic (STAC), to self-tune all the differentiable hyperparameters of an actor-critic loss function, to discover auxiliary tasks, and to improve off-policy learning using a novel leaky V-trace operator. STAC is simple to use, sample efficient and does not require a significant increase in compute. Ablative studies show that the overall performance of STAC improved as we adapt more hyperparameters. When applied to the Arcade Learning Environment (Bellemare et al. 2012), STAC improved the median human normalized score in 200M steps from 243% to 364%. When applied to the DM Control suite (Tassa et al., 2018), STAC improved the mean score in 30M steps from 217 to 389 when learning with features, from 108 to 202 when learning from pixels, and from 195 to 295 in the Real-World Reinforcement Learning Challenge (Dulac-Arnold et al., 2020).
翻译:强化学习算法对选择超参数非常敏感,通常需要大量手工努力,确定在新领域表现良好的超参数。在本文件中,我们通过使用元进化法,自动调整超参数,通过元进化下降(Xu等人,2018年),朝解决这一问题迈出了一步。我们应用了我们的算法,即自图图显示器-Critical-Critical(STAC),以自我调节所有不同的超强参数,使一个行为者-critical损失函数的中位超参数从243%升至364%,发现辅助任务,并利用新的漏泄漏V-traces操作器改进离政策学习。STAC使用简单,取样效率高,不需要大幅提高计算能力。简化研究表明,随着我们更适应超度参数,STAC的总体表现得到改善。当应用Arcade学习环境(Bellemare等人,2012年),STAC将人类中位的中位超标分从243%升至364%。当应用D控制套房时(Tassa et al,20188),STAC改进了标准,从20178到208 Steistria-Slem学习30,从2018,从2018,从208到2018,从2018 Stylem 学习,从208到205的成绩分分分分,从2018到208到30。