Model-free deep reinforcement learning (RL) has been successfully applied to challenging continuous control domains. However, poor sample efficiency prevents these methods from being widely used in real-world domains. We address this problem by proposing a novel model-free algorithm, Realistic Actor-Critic(RAC), which aims to solve trade-offs between value underestimation and overestimation by learning a policy family concerning various confidence-bounds of Q-function. We construct uncertainty punished Q-learning(UPQ), which uses uncertainty from the ensembling of multiple critics to control estimation bias of Q-function, making Q-functions smoothly shift from lower- to higher-confidence bounds. With the guide of these critics, RAC employs Universal Value Function Approximators (UVFA) to simultaneously learn many optimistic and pessimistic policies with the same neural network. Optimistic policies generate effective exploratory behaviors, while pessimistic policies reduce the risk of value overestimation to ensure stable updates of policies and Q-functions. The proposed method can be incorporated with any off-policy actor-critic RL algorithms. Our method achieve 10x sample efficiency and 25\% performance improvement compared to SAC on the most challenging Humanoid environment, obtaining the episode reward $11107\pm 475$ at $10^6$ time steps. All the source codes are available at https://github.com/ihuhuhu/RAC.
翻译:无模型深度强化学习(RL)已成功应用于挑战连续控制域。然而,由于抽样效率低,这些方法无法在现实世界域被广泛使用。我们通过提出一个新的无模型算法(Realistic Actor-Critic (RAC))来解决这一问题,该算法旨在解决价值低估和高估之间的权衡,该算法旨在通过学习关于各种信任范围的Q功能的政策大家庭来解决价值低估和高估之间的权衡问题。我们构建了惩罚性Q学习(UPQQ)的不确定性,它利用多方批评者集合的不确定性来控制Q功能的估测偏差,使Q功能从低信任的界限平稳地从低向高信任的界限转变。根据这些批评家的指南,RAC使用通用值函数匹配器(UVFFA)来同时学习许多乐观和悲观政策与同一神经网络之间的权衡。 乐观政策产生了有效的探索行为,而悲观政策减少了价值高估的风险,以确保政策和功能的稳定更新。拟议的方法可以与任何离政策方的S-RC/Rassimal的改进方法相结合。