This paper proposes a reinforcement learning framework to enhance the exploration-exploitation trade-off by learning a range of policies concerning various confidence bounds. The underestimated values provide stable updates but suffer from inefficient exploration behaviors. On the other hand, overestimated values can help the agent escape local optima, but it might cause over-exploration on low-value areas and function approximation errors accumulation. Algorithms have been proposed to mitigate the above contradiction. However, we lack an understanding of how the value bias impact performance and a method for efficient exploration while keeping value away from catastrophic overestimation bias accumulation. In this paper, we 1) highlight that both under- and overestimation bias can improve learning efficiency, and it is a particular form of the exploration-exploitation dilemma; 2) propose a unified framework called Realistic Actor-Critic(RAC), which employs Universal Value Function Approximators (UVFA) to simultaneously learn policies with different value confidence-bond with the same neural network, each with a different under-overestimation trade-off. This allows us to perform directed exploration without over-exploration using the upper bounds while still avoiding overestimation using the lower bounds. % 3) propose a variant of soft Bellman backup, called punished Bellman backup, which provides fine-granular estimation bias control to train policies efficiently. Through carefully designed experiments, We empirically verify that RAC achieves 10x sample efficiency and 25\% performance improvement compared to Soft Actor-Critic on the most challenging Humanoid environment. All the source codes are available at \url{https://github.com/ihuhuhu/RAC}.
翻译:本文提出一个强化学习框架, 以通过学习一系列关于各种信任界限的政策来强化勘探- 开发交易。 被低估的值提供了稳定的最新信息, 但却受到低效的勘探行为的影响。 另一方面, 高估的值可以帮助代理商逃离本地opima, 但可能导致低值地区过度探索, 并产生近似错误积累功能。 已经提议了算法以缓解上述矛盾。 但是, 我们不了解价值偏差如何影响价值, 以及高效勘探的方法, 同时又避免灾难性的过度估计偏差积累。 在本文中, 我们1 强调指出, 低估和高估的偏差可以提高学习效率, 而这是勘探- 开发两难中的一种特殊形式; 2 提议了一个叫做Realistical Actionor- Critict(RAC) 的统一框架, 该框架使用通用价值函数 Approductors(UVFA) 来同时学习不同价值信任博度的政策, 以及同一神经网络, 各自都有不同的低估过度估算交易。 这使我们可以在不进行过低度的估算交易中进行过低度分析。 这让我们在不进行过量的对25度的比较的勘探中进行细的探讨, 使用高度政策, 使用高度的校程校程校程校程的校程校程校程, 。