Intrinsic rewards can improve exploration in reinforcement learning, but the exploration process may suffer from instability caused by non-stationary reward shaping and strong dependency on hyperparameters. In this work, we introduce Decoupled RL (DeRL) as a general framework which trains separate policies for intrinsically-motivated exploration and exploitation. Such decoupling allows DeRL to leverage the benefits of intrinsic rewards for exploration while demonstrating improved robustness and sample efficiency. We evaluate DeRL algorithms in two sparse-reward environments with multiple types of intrinsic rewards. Our results show that DeRL is more robust to varying scale and rate of decay of intrinsic rewards and converges to the same evaluation returns than intrinsically-motivated baselines in fewer interactions. Lastly, we discuss the challenge of distribution shift and show that divergence constraint regularisers can successfully minimise instability caused by divergence of exploration and exploitation policies.
翻译:内在奖励可以改善强化学习中的勘探,但勘探进程可能因非固定奖励的形成和对超参数的强烈依赖而不稳定。在这项工作中,我们引入了脱coupled RL(DeRL)作为总框架,为具有内在动机的勘探和开采制定不同的政策。这种脱钩使DERL能够利用内在奖励对勘探的好处,同时展示更强的稳健性和采样效率。我们评估了两种具有多种内在奖励类型的微薄回报环境中的德RL算法。我们的结果显示,德RL(DeRL)更加强大,其内在奖励的规模和衰败速度各不相同,而且与在较少的互动中具有内在动机的基线相趋同。最后,我们讨论了分配变化的挑战,并表明差异制约了正规化者能够成功地减少勘探和开采政策差异造成的不稳定。