It is of significance for an agent to learn a widely applicable and general-purpose policy that can achieve diverse goals including images and text descriptions. Considering such perceptually-specific goals, the frontier of deep reinforcement learning research is to learn a goal-conditioned policy without hand-crafted rewards. To learn this kind of policy, recent works usually take as the reward the non-parametric distance to a given goal in an explicit embedding space. From a different viewpoint, we propose a novel unsupervised learning approach named goal-conditioned policy with intrinsic motivation (GPIM), which jointly learns both an abstract-level policy and a goal-conditioned policy. The abstract-level policy is conditioned on a latent variable to optimize a discriminator and discovers diverse states that are further rendered into perceptually-specific goals for the goal-conditioned policy. The learned discriminator serves as an intrinsic reward function for the goal-conditioned policy to imitate the trajectory induced by the abstract-level policy. Experiments on various robotic tasks demonstrate the effectiveness and efficiency of our proposed GPIM method which substantially outperforms prior techniques.
翻译:对于代理人来说,重要的是学习一种广泛适用和通用的政策,能够实现包括图像和文字描述在内的不同目标。考虑到这种概念特定的目标,深强化学习研究的前沿是学习一种没有手工制作的奖赏的、有目标条件的政策。为了学习这种政策,最近的工作通常将非参数距离作为奖励,在一个明确的嵌入空间里,某一目标之间的距离。从不同的观点看,我们提议一种新型的、不受监督的学习方法,称为目标限制的政策,具有内在动机(GPIM),它既学习抽象层次的政策,又学习一种有目标条件的政策。抽象层次的政策以潜在的变数为条件,以优化一个歧视者,发现多样化的国家,这些变数被进一步转化为目标条件政策的概念特定目标。学过的歧视者作为受目标限制的政策的内在奖赏功能,以模仿由抽象层次政策引出的轨迹。关于各种机器人任务的实验表明我们提议的GPM方法的有效性和效率,它大大超出以前的技术。