While unsupervised skill discovery has shown promise in autonomously acquiring behavioral primitives, there is still a large methodological disconnect between task-agnostic skill pretraining and downstream, task-aware finetuning. We present Intrinsic Reward Matching (IRM), which unifies these two phases of learning via the $\textit{skill discriminator}$, a pretraining model component often discarded during finetuning. Conventional approaches finetune pretrained agents directly at the policy level, often relying on expensive environment rollouts to empirically determine the optimal skill. However, often the most concise yet complete description of a task is the reward function itself, and skill learning methods learn an $\textit{intrinsic}$ reward function via the discriminator that corresponds to the skill policy. We propose to leverage the skill discriminator to $\textit{match}$ the intrinsic and downstream task rewards and determine the optimal skill for an unseen task without environment samples, consequently finetuning with greater sample-efficiency. Furthermore, we generalize IRM to sequence skills and solve more complex, long-horizon tasks. We demonstrate that IRM is competitive with previous skill selection methods on the Unsupervised Reinforcement Learning Benchmark and enables us to utilize pretrained skills far more effectively on challenging tabletop manipulation tasks.
翻译:虽然未受监督的技能发现在自主获取行为原始技术方面显示出了希望,但在方法上仍然存在着任务-不可知技能前训练与下游、任务-认知微调之间的巨大脱节。我们提出“Intrinsic Reward Reward 匹配”(IRM),它通过一个在微调期间常常被丢弃的“$textit{skill discriminor}$”来统一这两个学习阶段。常规方法直接在政策一级对经过预先训练的代理人进行微调,常常依靠昂贵的环境推出,以经验方式确定最佳技能。然而,对一项任务的最简单而完整的描述往往是奖赏功能本身,而技能学习方法则通过与技能政策相对应的歧视者来学习“$textit{intrinsic}$。我们提议利用技能歧视者到“$\textitripit{match}$troduction legtroduction $trodublemental lemental legal laft laft laft laft laft to proforgy legy laft laft laft laft laft leg