How can a reinforcement learning (RL) agent prepare to solve downstream tasks if those tasks are not known a priori? One approach is unsupervised skill discovery, a class of algorithms that learn a set of policies without access to a reward function. Such algorithms bear a close resemblance to representation learning algorithms (e.g., contrastive learning) in supervised learning, in that both are pretraining algorithms that maximize some approximation to a mutual information objective. While prior work has shown that the set of skills learned by such methods can accelerate downstream RL tasks, prior work offers little analysis into whether these skill learning algorithms are optimal, or even what notion of optimality would be appropriate to apply to them. In this work, we show that unsupervised skill discovery algorithms based on mutual information maximization do not learn skills that are optimal for every possible reward function. However, we show that the distribution over skills provides an optimal initialization minimizing regret against adversarially-chosen reward functions, assuming a certain type of adaptation procedure. Our analysis also provides a geometric perspective on these skill learning methods.
翻译:强化学习(RL)代理机构如何做好准备,在那些任务不事先知晓的情况下解决下游任务? 一种方法是未经监督的技能发现,这是一种学习一套政策而没有获得奖励功能的算法。这种算法与监督学习中的代表性学习算法(例如对比学习)非常相似,因为这两种算法都是培训前算法,最大限度地使某些接近于相互的信息目标。先前的工作已经表明,通过这种方法学习的技能组合可以加速下游的RL任务,但先前的工作很少对这些技能学习算法是否最优化或甚至优化概念是否适合适用于这些算法进行分析。在这项工作中,我们显示,基于相互信息最大化的未经监督的技能发现算法并不学习每一种可能的奖励功能的最佳技能。然而,我们表明,技能分配提供了最佳的初始化最小程度的遗憾,以某种适应程序为限。我们的分析还对这些技能学习方法提供了几何角度的视角。