Deep Reinforcement Learning (RL) has emerged as a powerful paradigm to solve a range of complex yet specific control tasks. Yet training generalist agents that can quickly adapt to new tasks remains an outstanding challenge. Recent advances in unsupervised RL have shown that pre-training RL agents with self-supervised intrinsic rewards can result in efficient adaptation. However, these algorithms have been hard to compare and develop due to the lack of a unified benchmark. To this end, we introduce the Unsupervised Reinforcement Learning Benchmark (URLB). URLB consists of two phases: reward-free pre-training and downstream task adaptation with extrinsic rewards. Building on the DeepMind Control Suite, we provide twelve continuous control tasks from three domains for evaluation and open-source code for eight leading unsupervised RL methods. We find that the implemented baselines make progress but are not able to solve URLB and propose directions for future research.
翻译:深入强化学习(RL)已成为解决一系列复杂而具体的控制任务的一个强有力的范例。然而,培训能够迅速适应新任务的通才人员仍然是一项尚未解决的挑战。在未受监督的远程学习(RL)方面最近的进展表明,以自我监督的内在回报为自我监督的预先培训RL代理人员可以带来有效的适应。然而,由于缺乏统一的基准,这些算法很难比较和发展。为此,我们引入了无人监督的强化学习基准(URLB )。URLB 包括两个阶段:免费培训前和下游任务适应和外部奖励。在深最小控制套件的基础上,我们从三个领域提供12项持续控制任务,用于评估和开放源代码,用于8项未经监督的主要RL方法。我们发现,执行的基线取得了进步,但无法解决URB,并为今后的研究提出方向。