While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the $\textit{skill discriminator}$, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM enables us to utilize pretrained skills far more effectively than previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and on challenging tabletop manipulation tasks.
翻译:虽然未受监督的技能发现在自主获取行为原始技术方面显示出了希望,但在方法上仍然存在着任务-不可知技能前训练与下游、任务-认知微调之间的巨大脱节。我们提出“Intrinsic Reward Reward 匹配”(IRM),它通过一个在微调期间常常被丢弃的“$textit{skill discriminor}$”来统一这两个学习阶段。常规方法在政策一级对事先培训的代理商进行精细微的精细分析,常常依靠昂贵的环境推出来亲身确定最佳技能。然而,对一项任务的最简单而完整的描述往往是奖励功能本身,而技能学习方法则通过与技能政策相对应的区分者学习“$textit{intrinsicric} $。我们提议利用技能歧视者到“$\textitilitorgunt{match $$$troduction duction duction lemental districal laft production for profore laft profortistrical produstry pass