Meta reinforcement learning (meta-RL) extracts knowledge from previous tasks and achieves fast adaptation to new tasks. Despite recent progress, efficient exploration in meta-RL remains a key challenge in sparse-reward tasks, as it requires quickly finding informative task-relevant experiences in both meta-training and adaptation. To address this challenge, we explicitly model an exploration policy learning problem for meta-RL, which is separated from exploitation policy learning, and introduce a novel empowerment-driven exploration objective, which aims to maximize information gain for task identification. We derive a corresponding intrinsic reward and develop a new off-policy meta-RL framework, which efficiently learns separate context-aware exploration and exploitation policies by sharing the knowledge of task inference. Experimental evaluation shows that our meta-RL method significantly outperforms state-of-the-art baselines on various sparse-reward MuJoCo locomotion tasks and more complex sparse-reward Meta-World tasks.
翻译:元强化学习(meta-RL)从先前的任务中获取知识,并实现快速适应新任务。尽管最近取得了进展,但元强化学习(meta-RL)的高效探索仍然是稀释任务中的一项关键挑战,因为它要求在元培训和适应两个方面迅速找到与任务相关的丰富经验。为了应对这一挑战,我们明确为元强化学习(meta-RL)树立了一个探索政策学习问题模型,该模型与开发政策学习分离,并引入了一个新的增强权能驱动的探索目标,目的是为确定任务而最大限度地增加信息收益。我们获得了相应的内在奖赏,并开发了一个新的脱离政策的元强化框架,通过分享任务推理知识,有效地学习了不同背景意识的探索和利用政策。实验性评估表明,我们的元强化方法大大超越了各种稀释的 Mujoco 移动任务和更为复杂的稀释型元世界任务的最新基线。