Several recent works have been dedicated to unsupervised reinforcement learning in a single environment, in which a policy is first pre-trained with unsupervised interactions, and then fine-tuned towards the optimal policy for several downstream supervised tasks defined over the same environment. Along this line, we address the problem of unsupervised reinforcement learning in a class of multiple environments, in which the policy is pre-trained with interactions from the whole class, and then fine-tuned for several tasks in any environment of the class. Notably, the problem is inherently multi-objective as we can trade off the pre-training objective between environments in many ways. In this work, we foster an exploration strategy that is sensitive to the most adverse cases within the class. Hence, we cast the exploration problem as the maximization of the mean of a critical percentile of the state visitation entropy induced by the exploration strategy over the class of environments. Then, we present a policy gradient algorithm, $\alpha$MEPOL, to optimize the introduced objective through mediated interactions with the class. Finally, we empirically demonstrate the ability of the algorithm in learning to explore challenging classes of continuous environments and we show that reinforcement learning greatly benefits from the pre-trained exploration strategy w.r.t. learning from scratch.
翻译:最近的一些著作致力于在单一环境中进行不受监督的强化学习,在这种环境中,一项政策首先经过未经监督的相互影响,先经过未经监督的强化学习,然后对同一环境中界定的若干下游监督任务的最佳政策进行微调。沿着这一思路,我们处理在多种环境中未经监督的强化学习问题,在这种环境中,政策先经过整个阶层的互动培训,然后对班级的任何环境中的若干任务进行微调。值得注意的是,问题本质上是多方面的,因为我们可以在许多方面在环境之间交换培训前的目标。在这项工作中,我们促进一种对班级中最不利案例敏感的探索战略。因此,我们把探索问题描绘为在环境舱中探索战略引发的状态访问关键百分率的最大化。然后,我们提出一种政策梯度算法,即$alpha$MEPOL,以便通过与班级的调解互动优化引入的目标。最后,我们从经验上展示了算法在探索具有挑战性的环境周期中的能力。我们展示了从持续环境中学习的强化战略,从深度学习。我们从探索战略中学习的极大的好处。