Exploration is a fundamental aspect of reinforcement learning (RL), and its effectiveness crucially decides the performance of RL algorithms, especially when facing sparse extrinsic rewards. Recent studies showed the effectiveness of encouraging exploration with intrinsic rewards estimated from novelty in observations. However, there is a gap between the novelty of an observation and an exploration in general, because the stochasticity in the environment as well as the behavior of an agent may affect the observation. To estimate exploratory behaviors accurately, we propose DEIR, a novel method where we theoretically derive an intrinsic reward from a conditional mutual information term that principally scales with the novelty contributed by agent explorations, and materialize the reward with a discriminative forward model. We conduct extensive experiments in both standard and hardened exploration games in MiniGrid to show that DEIR quickly learns a better policy than baselines. Our evaluations in ProcGen demonstrate both generalization capabilities and the general applicability of our intrinsic reward.
翻译:探索是强化学习(RL)的基本方面,并且其有效性决定RL算法的性能,特别是面对稀疏外部奖励时。最近的研究表明,通过从观察中的新奇性估算内在奖励来鼓励探索是有效的。但是,观察的新奇性与探索之间存在差距,因为环境中的随机性以及代理的行为可能会影响观察结果。为了准确估计探索行为,我们提出了DEIR,这是一种全新的方法,我们从条件互信息项中理论上导出了一种内在奖励,该项主要随着代理探索贡献的新奇性而缩放,并通过辨别前向模型实现该奖励。我们在MiniGrid中进行了标准和耐强制性探索游戏的大量实验,以展示DEIR比基线快速学习更好的策略。在ProcGen中对其进行的评估展示了其泛化能力和内在奖励的普适性。