The ability to exploit prior experience to solve novel problems rapidly is a hallmark of biological learning systems and of great practical importance for artificial ones. In the meta reinforcement learning literature much recent work has focused on the problem of optimizing the learning process itself. In this paper we study a complementary approach which is conceptually simple, general, modular and built on top of recent improvements in off-policy learning. The framework is inspired by ideas from the probabilistic inference literature and combines robust off-policy learning with a behavior prior, or default behavior that constrains the space of solutions and serves as a bias for exploration; as well as a representation for the value function, both of which are easily learned from a number of training tasks in a multi-task scenario. Our approach achieves competitive adaptation performance on hold-out tasks compared to meta reinforcement learning baselines and can scale to complex sparse-reward scenarios.
翻译:利用先前的经验迅速解决新问题的能力是生物学习系统的一个标志,对人工问题具有极大的实际重要性。在元强化学习文献中,最近许多工作都侧重于优化学习过程本身的问题。在本文中,我们研究一种补充方法,这种方法在概念上简单、笼统、模块化,是最近政策外学习改进的顶点。该框架受概率推论文献中各种想法的启发,将强健的离政策学习与限制解决方案空间并成为探索偏见的先前行为或默认行为结合起来;以及价值功能的体现,两者都是在多任务情景中从若干培训任务中容易学到的。我们的方法在与元强化学习基线相比的搁置任务上实现了有竞争力的适应性业绩,并可以扩展到复杂的零散变化情景。