In real-world scenarios, reinforcement learning under sparse-reward synergistic settings has remained challenging, despite surging interests in this field. Previous attempts suggest that intrinsic reward can alleviate the issue caused by sparsity. In this paper, we present a novel intrinsic reward that is inspired by human learning, as humans evaluate curiosity by comparing current observations with historical knowledge. Specifically, we train a self-supervised prediction model and save a set of snapshots of the model parameters, without incurring addition training cost. Then we employ nuclear norm to evaluate the temporal inconsistency between the predictions of different snapshots, which can be further deployed as the intrinsic reward. Moreover, a variational weighting mechanism is proposed to assign weight to different snapshots in an adaptive manner. We demonstrate the efficacy of the proposed method in various benchmark environments. The results suggest that our method can provide overwhelming state-of-the-art performance compared with other intrinsic reward-based methods, without incurring additional training costs and maintaining higher noise tolerance. Our code will be released publicly to enhance reproducibility.
翻译:在现实世界的情景中,尽管在这一领域的兴趣大增,在微薄回报的协同效应环境中的强化学习仍然具有挑战性,尽管在这方面的兴趣大增。先前的尝试表明,内在奖赏可以缓解由夸大造成的问题。在本文中,我们展示了人类学习激励的新颖内在奖赏,因为人类通过比较当前观察与历史知识来评估好奇心。具体地说,我们训练了一种自我监督的预测模型,保存了一套模型参数的快照,而无需增加培训费用。然后,我们运用核规范来评价不同照片预测之间的时间不一致,而不同照片的预测可以进一步作为内在奖赏。此外,还提议了一种变式权重机制,以适应的方式给不同照片分配权重。我们展示了各种基准环境中拟议方法的功效。结果表明,我们的方法可以提供压倒性状态的性能,而与其他固有的奖赏方法相比,而不会产生额外的培训费用并保持更高的噪音容忍度。我们的代码将公开发布,以提高再生能力。