Recent state-of-the-art artificial agents lack the ability to adapt rapidly to new tasks, as they are trained exclusively for specific objectives and require massive amounts of interaction to learn new skills. Meta-reinforcement learning (meta-RL) addresses this challenge by leveraging knowledge learned from training tasks to perform well in previously unseen tasks. However, current meta-RL approaches limit themselves to narrow parametric task distributions, ignoring qualitative differences between tasks that occur in the real world. In this paper, we introduce TIGR, a Task-Inference-based meta-RL algorithm using Gaussian mixture models (GMM) and gated Recurrent units, designed for tasks in non-parametric environments. We employ a generative model involving a GMM to capture the multi-modality of the tasks. We decouple the policy training from the task-inference learning and efficiently train the inference mechanism on the basis of an unsupervised reconstruction objective. We provide a benchmark with qualitatively distinct tasks based on the half-cheetah environment and demonstrate the superior performance of TIGR compared to state-of-the-art meta-RL approaches in terms of sample efficiency (3-10 times faster), asymptotic performance, and applicability in non-parametric environments with zero-shot adaptation.
翻译:最近最先进的人工代理商缺乏迅速适应新任务的能力,因为他们是专门为特定目标而培训的,需要大量互动才能学习新技能。元加强学习(元-RL)通过利用从培训任务中学到的知识来应对这一挑战,从而在以往的不可见任务中很好地执行;然而,目前的元调整方法仅限于缩小任务分配的参数,忽视现实世界中出现的任务之间的质量差异。在本文中,我们引入了TIGR,即基于任务、基于指数的元-RL算法,使用高斯混合模型(GMM)和封闭的经常单元,为非参数环境中的任务设计。我们采用了涉及GMMM的基因模型,以捕捉任务的多种模式。我们将政策培训与任务偏差的学习区分开来,并有效培训在未经监督的重建目标基础上出现的推论机制。我们提供了一个基准,根据半先天环境环境(GMMM)和封闭式经常单元,为非参数环境中的状态和状态-10级的适应性效率展示了TIGR的优异性表现(3-10级方法),在不快速的测试-10级方法中,以更快地调整为基准。