The goal of Unsupervised Reinforcement Learning (URL) is to find a reward-agnostic prior policy on a task domain, such that the sample-efficiency on supervised downstream tasks is improved. Although agents initialized with such a prior policy can achieve a significantly higher reward with fewer samples when finetuned on the downstream task, it is still an open question how an optimal pretrained prior policy can be achieved in practice. In this work, we present POLTER (Policy Trajectory Ensemble Regularization) - a general method to regularize the pretraining that can be applied to any URL algorithm and is especially useful on data- and knowledge-based URL algorithms. It utilizes an ensemble of policies that are discovered during pretraining and moves the policy of the URL algorithm closer to its optimal prior. Our method is based on a theoretical framework, and we analyze its practical effects on a white-box benchmark, allowing us to study POLTER with full control. In our main experiments, we evaluate POLTER on the Unsupervised Reinforcement Learning Benchmark (URLB), which consists of 12 tasks in 3 domains. We demonstrate the generality of our approach by improving the performance of a diverse set of data- and knowledge-based URL algorithms by 19% on average and up to 40% in the best case. Under a fair comparison with tuned baselines and tuned POLTER, we establish a new the state-of-the-art on the URLB.
翻译:无人监督的强化学习(URL)的目标是在任务领域找到一种奖励和不可知的先前政策,以便提高受监督的下游任务的抽样效率。虽然采用这种先期政策的代理人在对下游任务进行微调时可以取得高得多的奖赏,但对于如何在实践中实现最佳的事先经过预先训练的政策,仍然是一个未决问题。在这项工作中,我们提出POLTER(政策轨迹综合正规化)——一种将培训前培训正规化的一般方法,可以适用于任何URL算法,而且对于基于数据和知识的URL算法特别有用。它利用了在培训前发现的一系列政策,将URL算法的政策推到更接近其最优之前的样品。我们的方法以理论框架为基础,我们分析了其对白箱基准的实际影响,使我们能够完全控制地研究POLTER(PER)。在我们的主要实验中,我们评价POLTER(ULB),该基准由12项任务组成,对基于数据和基于3个域域域的URLLLLLL法进行特别的比较。我们用一个普通的、最多样化的、最多样化的RLA方法,我们用在案例下改进了19的RLULA方法,从而改进了我们的标准,从而改进了我们以最佳的进度和最多样化的40的进度的进度。我们的数据的进度和最多样化的B。我们用一个普通的进度比。