Data selection is essential for any data-based optimization technique, such as Reinforcement Learning. State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. However, they do not incorporate uncertainty in the Q-Value estimation. Consequently, they cannot adapt the sampling strategies, including exploration and exploitation of transitions, to the complexity of the task. To address this, this paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off. This is enabled by the uncertainty estimation of the Q-Value function, which guides the sampling to explore more significant transitions and, thus, learn a more efficient policy. Experiments on classical control environments demonstrate stable results across various environments. They show that the proposed method outperforms state-of-the-art sampling strategies for dense rewards w.r.t. convergence and peak performance by 26% on average.
翻译:数据选择对于任何基于数据优化技术(如强化学习)至关重要。经验回放缓冲区的最新抽样策略可以提高强化学习代理的性能。然而,它们没有考虑到 Q 值估计的不确定性。因此,它们不能将探索和开发转换策略适应于任务的复杂性。为了解决这个问题,本文提出了一种新的抽样策略,利用了探索-开发权衡。这是通过 Q 值函数的不确定性估计实现的,指导抽样以探索更重要的转换,从而学习更有效的策略。在经典控制环境的实验中,证明了在各种环境中都具有稳定的结果。实验结果表明,在稠密奖励方面,该方法的收敛和峰值性能平均优于最先进的抽样策略 26%。