Meta reinforcement learning (Meta-RL) is an approach wherein the experience gained from solving a variety of tasks is distilled into a meta-policy. The meta-policy, when adapted over only a small (or just a single) number of steps, is able to perform near-optimally on a new, related task. However, a major challenge to adopting this approach to solve real-world problems is that they are often associated with sparse reward functions that only indicate whether a task is completed partially or fully. We consider the situation where some data, possibly generated by a sub-optimal agent, is available for each task. We then develop a class of algorithms entitled Enhanced Meta-RL using Demonstrations (EMRLD) that exploit this information even if sub-optimal to obtain guidance during training. We show how EMRLD jointly utilizes RL and supervised learning over the offline data to generate a meta-policy that demonstrates monotone performance improvements. We also develop a warm started variant called EMRLD-WS that is particularly efficient for sub-optimal demonstration data. Finally, we show that our EMRLD algorithms significantly outperform existing approaches in a variety of sparse reward environments, including that of a mobile robot.
翻译:元元强化学习(Meta-RL)是一种方法,通过这一方法,从解决各种任务中获得的经验被提炼成一个元政策。元政策,如果仅对少量(或仅一个)步骤进行调整,就能够在一项新的相关任务上几乎最优化地执行。然而,采用这一方法解决现实世界问题的重大挑战是,它们往往与稀薄的奖励功能有关,这些功能只能表明一项任务是部分完成还是完全完成。我们考虑的是,每项任务都可获得一些可能由亚最佳代理产生的数据。然后,我们开发出一类名为“使用演示(EMRL)增强元-RL”的算法,利用这一信息,即便在培训期间获得指导是次最佳的。我们展示了EMRLD如何联合利用RL并监督从离线数据中学习来产生显示单一性绩效改进的元政策。我们还开发了一种称为EMRLD-W的温暖的变体变体,对于亚最佳演示数据特别有效。最后,我们展示了EMERLD的算法大大超越了现有各种机器人环境,包括了目前各种的机器人。