Most reinforcement learning algorithms take advantage of an experience replay buffer to repeatedly train on samples the agent has observed in the past. Not all samples carry the same amount of significance and simply assigning equal importance to each of the samples is a na\"ive strategy. In this paper, we propose a method to prioritize samples based on how much we can learn from a sample. We define the learn-ability of a sample as the steady decrease of the training loss associated with this sample over time. We develop an algorithm to prioritize samples with high learn-ability, while assigning lower priority to those that are hard-to-learn, typically caused by noise or stochasticity. We empirically show that our method is more robust than random sampling and also better than just prioritizing with respect to the training loss, i.e. the temporal difference loss, which is used in prioritized experience replay.
翻译:多数强化学习算法都利用经验重放缓冲来反复训练该剂过去观察到的样本。 不是所有样本都具有同等的重要性,而只是赋予每个样本同等重要性是一种“na\'ive”战略。在本文中,我们建议了一种方法,根据我们从样本中可以学到的多少来对样本进行优先排序。我们将样本的可学习性定义为与该样本相关培训损失的长期稳步减少。我们开发了一种对高学习能力样本进行优先排序的算法,同时对通常由噪音或随机性造成的难以从业者给予较低优先。我们从经验上表明,我们的方法比随机抽样更为健全,也比仅仅优先处理培训损失(即时间差异损失)更好,后者用于优先经验重现。