Reinforcement learning (RL) algorithms are typically limited to learning a single solution of a specified task, even though there often exists diverse solutions to a given task. Compared with learning a single solution, learning a set of diverse solutions is beneficial because diverse solutions enable robust few-shot adaptation and allow the user to select a preferred solution. Although previous studies have showed that diverse behaviors can be modeled with a policy conditioned on latent variables, an approach for modeling an infinite set of diverse solutions with continuous latent variables has not been investigated. In this study, we propose an RL method that can learn infinitely many solutions by training a policy conditioned on a continuous or discrete low-dimensional latent variable. Through continuous control tasks, we demonstrate that our method can learn diverse solutions in a data-efficient manner and that the solutions can be used for few-shot adaptation to solve unseen tasks.
翻译:强化学习( RL) 算法通常限于学习特定任务的单一解决方案, 尽管对特定任务往往存在多种解决方案。 与学习单一解决方案相比, 学习一系列不同的解决方案是有益的, 因为多种解决方案能够让用户能够进行强力的微小调整, 并允许用户选择首选解决方案。 尽管先前的研究显示, 不同的行为可以以潜在变量为条件的政策模式, 但还没有调查一套模型化的无限的、有连续潜在变量的多种解决方案的方法。 在本研究中, 我们提出一种RL方法, 通过培训以连续或离散的低维潜伏变量为条件的政策来学习无限多的解决方案。 我们通过持续控制任务, 证明我们的方法可以以数据高效的方式学习多种解决方案, 并且这些解决方案可以用于少量的适应, 以解决不可见的任务 。