Self-imitation learning is a Reinforcement Learning (RL) method that encourages actions whose returns were higher than expected, which helps in hard exploration and sparse reward problems. It was shown to improve the performance of on-policy actor-critic methods in several discrete control tasks. Nevertheless, applying self-imitation to the mostly action-value based off-policy RL methods is not straightforward. We propose SAIL, a novel generalization of self-imitation learning for off-policy RL, based on a modification of the Bellman optimality operator that we connect to Advantage Learning. Crucially, our method mitigates the problem of stale returns by choosing the most optimistic return estimate between the observed return and the current action-value for self-imitation. We demonstrate the empirical effectiveness of SAIL on the Arcade Learning Environment, with a focus on hard exploration games.
翻译:自我限制学习是一种强化学习方法,鼓励回报高于预期的行动,有助于硬性探索和少许奖励问题,表明在若干独立的控制任务中改进了在政策上行为者-批评方法的性能。然而,对多数基于行动价值的脱离政策RL方法采用自我限制并非直截了当。我们提议SAIL,这是对脱离政策学习RL自我限制学习的一种新颖的概括,其基础是修改我们连接到优势学习的贝尔曼最佳操作器。关键是,我们的方法通过在观察到的返回和当前自我限制的行动价值之间选择最乐观的返回估计,减轻了陈旧的回报问题。我们展示了SAIL在Arcade学习环境中的经验效果,重点是硬性探索游戏。