Model-free deep reinforcement learning (RL) has been successfully applied to challenging continuous control domains. However, poor sample efficiency prevents these methods from being widely used in real-world domains. This paper introduces a novel model-free algorithm, Realistic Actor-Critic(RAC), which can be incorporated with any off-policy RL algorithms to improve sample efficiency. RAC employs Universal Value Function Approximators (UVFA) to simultaneously learn a policy family with the same neural network, each with different trade-offs between underestimation and overestimation. To learn such policies, we introduce uncertainty punished Q-learning, which uses uncertainty from the ensembling of multiple critics to build various confidence-bounds of Q-function. We evaluate RAC on the MuJoCo benchmark, achieving 10x sample efficiency and 25% performance improvement on the most challenging Humanoid environment compared to SAC.
翻译:在挑战连续控制领域时,成功应用了无模型深度强化学习(RL)来挑战连续控制领域。然而,由于抽样效率低,这些方法无法被广泛用于现实世界领域。本文引入了一种新的无模型算法,即现实动作-批评(RAC),它可以与任何非政策性RL算法相结合,以提高抽样效率。RAC使用通用价值函数比对器(UVFA),同时学习一个具有相同神经网络的政策家庭,每个系统在低估和高估之间取舍不同。为了了解这些政策,我们引入了不确定性惩罚Q-学习,利用多重批评者组合产生的不确定性来建立多种Q功能的信任度。我们根据MuJoCO基准评估RAC,在最具挑战性的人型环境中实现10x样本效率和25%的性能改进。