In this paper, hypernetworks are trained to generate behaviors across a range of unseen task conditions, via a novel TD-based training objective and data from a set of near-optimal RL solutions for training tasks. This work relates to meta RL, contextual RL, and transfer learning, with a particular focus on zero-shot performance at test time, enabled by knowledge of the task parameters (also known as context). Our technical approach is based upon viewing each RL algorithm as a mapping from the MDP specifics to the near-optimal value function and policy and seek to approximate it with a hypernetwork that can generate near-optimal value functions and policies, given the parameters of the MDP. We show that, under certain conditions, this mapping can be considered as a supervised learning problem. We empirically evaluate the effectiveness of our method for zero-shot transfer to new reward and transition dynamics on a series of continuous control tasks from DeepMind Control Suite. Our method demonstrates significant improvements over baselines from multitask and meta RL approaches.
翻译:在本文中,对超网络进行了培训,以便通过一套基于TD的新型培训目标和来自一套培训任务的近最佳RL解决方案的数据,在一系列看不见的任务条件下产生行为。这项工作涉及元RL、背景RL和转移学习,特别侧重于测试时的零弹性能,这是根据对任务参数(又称为上下文)的了解而促成的。我们的技术方法基于将每个RL算法视为从MDP具体特性到接近最佳值函数和政策的映射,并试图将其与能够产生近最佳值功能和政策的超网络相近。根据MDP的参数,我们表明,在某些条件下,这种绘图可被视为一个监督的学习问题。我们实证地评估了我们从DeepMind控制套接连串连续控制任务零弹性向新的奖励和过渡动态的方法的有效性。我们的方法表明,从多任务和元RL方法的基线上取得了显著改进。