Designing agents, capable of learning autonomously a wide range of skills is critical in order to increase the scope of reinforcement learning. It will both increase the diversity of learned skills and reduce the burden of manually designing reward functions for each skill. Self-supervised agents, setting their own goals, and trying to maximize the diversity of those goals have shown great promise towards this end. However, a currently known limitation of agents trying to maximize the diversity of sampled goals is that they tend to get attracted to noise or more generally to parts of the environments that cannot be controlled (distractors). When agents have access to predefined goal features or expert knowledge, absolute Learning Progress (ALP) provides a way to distinguish between regions that can be controlled and those that cannot. However, those methods often fall short when the agents are only provided with raw sensory inputs such as images. In this work we extend those concepts to unsupervised image-based goal exploration. We propose a framework that allows agents to autonomously identify and ignore noisy distracting regions while searching for novelty in the learnable regions to both improve overall performance and avoid catastrophic forgetting. Our framework can be combined with any state-of-the-art novelty seeking goal exploration approaches. We construct a rich 3D image based environment with distractors. Experiments on this environment show that agents using our framework successfully identify interesting regions of the environment, resulting in drastically improved performances. The source code is available at https://sites.google.com/view/grimgep.
翻译:设计代理人能够自主地学习广泛的技能,这对于扩大强化学习的范围至关重要,这将增加学习技能的多样性,减少手工设计每种技能的奖励功能的负担。自我监督的代理人,制定他们自己的目标,并尽量使这些目标多样化,都显示了巨大的希望。然而,目前已知的对试图最大限度地扩大抽样目标多样性的代理人的限制是,他们往往被噪音吸引,或更一般地被无法控制的环境的某些部分(吸引者)吸引。当代理人能够获得预先确定的目标特征或专业知识时,绝对学习进步(ALP)提供了一种区分可以控制的区域和无法控制的区域的方法。然而,当这些代理人仅仅得到原始的感官投入如图像时,这些方法往往会落空。在这项工作中,我们将这些概念扩展到不受监督的图像基于图像的目标探索。我们提议一个框架,允许代理人自主地识别和忽视噪音分散的区域,同时在可学习的区域寻找新颖之处,既可以改进总体业绩,也可以避免灾难性的遗忘。我们的框架可以与任何基于现状的探索/遥感工具一起,在任何基于成果的勘探环境中,从而展示我们所追求的新的环境。