Data selection is essential for any data-based optimization technique, such as Reinforcement Learning. State-of-the-art sampling strategies for the experience replay buffer improve the performance of the Reinforcement Learning agent. However, they do not incorporate uncertainty in the Q-Value estimation. Consequently, they cannot adapt the sampling strategies, including exploration and exploitation of transitions, to the complexity of the task. To address this, this paper proposes a new sampling strategy that leverages the exploration-exploitation trade-off. This is enabled by the uncertainty estimation of the Q-Value function, which guides the sampling to explore more significant transitions and, thus, learn a more efficient policy. Experiments on classical control environments demonstrate stable results across various environments. They show that the proposed method outperforms state-of-the-art sampling strategies for dense rewards w.r.t. convergence and peak performance by 26% on average.
翻译:数据选择对于任何基于数据的优化技术都至关重要,例如“强化学习”等。经验回放缓冲最先进的抽样战略改善了“强化学习”工具的性能。但是,它们并没有将不确定性纳入“Q-Value”估计中。因此,它们无法根据任务的复杂性调整取样战略,包括探索和利用过渡,以适应任务的复杂性。为解决这一问题,本文件提出了利用勘探-开发权衡的新抽样战略。这得益于对“Q-Value”功能的不确定性估计。Q-Value功能指导取样探索更重大的过渡,从而学习更有效的政策。对古典控制环境的实验表明,在不同环境中取得稳定的结果。它们表明,拟议的方法优于最先进的采样战略,以获得密集的回报(r.t.). 趋同和高峰性能平均达到26%。