Despite recent success of deep network-based Reinforcement Learning (RL), it remains elusive to achieve human-level efficiency in learning novel tasks. While previous efforts attempt to address this challenge using meta-learning strategies, they typically suffer from sampling inefficiency with on-policy RL algorithms or meta-overfitting with off-policy learning. In this work, we propose a novel meta-RL strategy to address those limitations. In particular, we decompose the meta-RL problem into three sub-tasks, task-exploration, task-inference and task-fulfillment, instantiated with two deep network agents and a task encoder. During meta-training, our method learns a task-conditioned actor network for task-fulfillment, an explorer network with a self-supervised reward shaping that encourages task-informative experiences in task-exploration, and a context-aware graph-based task encoder for task inference. We validate our approach with extensive experiments on several public benchmarks and the results show that our algorithm effectively performs exploration for task inference, improves sample efficiency during both training and testing, and mitigates the meta-overfitting problem.
翻译:尽管在深层次的网络强化学习(RL)最近取得了成功,但在学习新任务方面,实现人的水平效率仍然难以实现。虽然以前曾努力试图利用元学习战略应对这一挑战,但通常会遇到与政策性RL算法或与离政策学习相适应的元更新方法相比的低效率问题。在这项工作中,我们提出了一个新的元RL战略,以克服这些限制。特别是,我们将元RL问题分解成三个子任务、任务探索、任务推导和任务完成,同时同时有两个深层次网络代理和任务编码。在元培训期间,我们的方法学习了一个任务性强的行为者网络,一个自我监督的奖励网络,这种网络鼓励在任务探索和测试方面的任务强化经验,以及一个有背景意识的图表任务推理任务协调器。我们用对几个公共基准的广泛实验来验证我们的方法,结果显示,我们的算法有效地进行了任务推理的探索,提高了培训、测试、测试和缩小了定型问题中的抽样效率。