Reusing previously trained models is critical in deep reinforcement learning to speed up training of new agents. However, it is unclear how to acquire new skills when objectives and constraints are in conflict with previously learned skills. Moreover, when retraining, there is an intrinsic conflict between exploiting what has already been learned and exploring new skills. In soft actor-critic (SAC) methods, a temperature parameter can be dynamically adjusted to weight the action entropy and balance the explore $\times$ exploit trade-off. However, controlling a single coefficient can be challenging within the context of retraining, even more so when goals are contradictory. In this work, inspired by neuroscience research, we propose a novel approach using inhibitory networks to allow separate and adaptive state value evaluations, as well as distinct automatic entropy tuning. Ultimately, our approach allows for controlling inhibition to handle conflict between exploiting less risky, acquired behaviors and exploring novel ones to overcome more challenging tasks. We validate our method through experiments in OpenAI Gym environments.
翻译:在深入强化学习以加快培训新代理人方面,必须重新使用以前训练过的模型,但尚不清楚在目标和制约因素与以前学到的技能发生冲突时如何获得新的技能。此外,在再培训时,利用已经学到的东西与探索新的技能之间存在内在冲突。在软性行为者-批评(SAC)方法中,可以动态地调整温度参数以权重行动激流和平衡探索利用的权衡。然而,在再培训方面,控制单一系数可能具有挑战性,在目标相互矛盾时尤其如此。在这项工作中,在神经科学研究的启发下,我们提议采用新颖的方法,利用抑制性网络,允许进行独立和适应性的国家价值评价,以及独特的自动加密调节。最终,我们的方法允许控制抑制利用风险较小、获得的行为和探索新颖的行为来克服更具挑战性的任务之间的冲突。我们通过在OpenAI Gym环境中的实验来验证我们的方法。