Model-free deep reinforcement learning (RL) has been successfully applied to challenging continuous control domains. However, poor sample efficiency prevents these methods from being widely used in real-world domains. This paper introduces a novel model-free algorithm, Realistic Actor-Critic(RAC), which can be incorporated with any off-policy RL algorithms to improve sample efficiency. RAC employs Universal Value Function Approximators (UVFA) to simultaneously learn a policy family with the same neural network, each with different trade-offs between underestimation and overestimation. To learn such policies, we introduce uncertainty punished Q-learning, which uses uncertainty from the ensembling of multiple critics to build various confidence-bounds of Q-function. We evaluate RAC on the MuJoCo benchmark, achieving 10x sample efficiency and 25\% performance improvement on the most challenging Humanoid environment compared to SAC.
翻译:在挑战连续控制领域方面成功应用了无模型深度强化学习(RL),但是,由于抽样效率差,这些方法无法被广泛用于现实世界领域。本文介绍了一种新的无模型算法,即现实动作-批评(RAC),它可以与任何非政策性RL算法相结合,以提高抽样效率。RAC使用通用价值函数比对器(UVFA),同时学习一个具有相同神经网络的政策家庭,每个系统在低估和高估之间取舍不同。为了了解这些政策,我们引入了受罚的不确定性Q-学习,利用多重批评者组合产生的不确定性来建立各种Q功能的信任度。我们根据MuJoCo基准评估RAC,在最具挑战性的人类环境方面实现10x抽样效率和25 ⁇ 的性能改进。