Efficient exploration in reinforcement learning is a challenging problem commonly addressed through intrinsic rewards. Recent prominent approaches are based on state novelty or variants of artificial curiosity. However, directly applying them to partially observable environments can be ineffective and lead to premature dissipation of intrinsic rewards. Here we propose random curiosity with general value functions (RC-GVF), a novel intrinsic reward function that draws upon connections between these distinct approaches. Instead of using only the current observation's novelty or a curiosity bonus for failing to predict precise environment dynamics, RC-GVF derives intrinsic rewards through predicting temporally extended general value functions. We demonstrate that this improves exploration in a hard-exploration diabolical lock problem. Furthermore, RC-GVF significantly outperforms previous methods in the absence of ground-truth episodic counts in the partially observable MiniGrid environments. Panoramic observations on MiniGrid further boost RC-GVF's performance such that it is competitive to baselines exploiting privileged information in form of episodic counts.
翻译:强化学习方面的高效探索是一个挑战性的问题,通常通过内在回报来解决。最近突出的方法基于国家创新或人工好奇的变体。 但是,直接将其应用于部分可见的环境可能无效,导致内在回报的过早消散。 在这里,我们提出了具有一般价值功能的随机好奇(RC-GVF),这是利用这些不同方法之间联系的一种新颖的内在奖励功能。 RC-GVF不是仅仅利用当前观察的新奇或好奇奖金来预测准确的环境动态,而是通过预测时间延伸的一般价值功能来获得内在回报。 我们证明,这改善了在硬勘探阴道锁问题的探索。 此外,RC-GVF在部分可见的迷你Grid环境中,在没有地面图解缩现象的情况下,大大超越了以往的方法。关于MiniGrid的全景观察进一步提升了RC-GVF的性表现,从而具有竞争力,可以将利用特权信息作为惯用异谱进行基线。