Unsupervised reinforcement learning (RL) studies how to leverage environment statistics to learn useful behaviors without the cost of reward engineering. However, a central challenge in unsupervised RL is to extract behaviors that meaningfully affect the world and cover the range of possible outcomes, without getting distracted by inherently unpredictable, uncontrollable, and stochastic elements in the environment. To this end, we propose an unsupervised RL method designed for high-dimensional, stochastic environments based on an adversarial game between two policies (which we call Explore and Control) controlling a single body and competing over the amount of observation entropy the agent experiences. The Explore agent seeks out states that maximally surprise the Control agent, which in turn aims to minimize surprise, and thereby manipulate the environment to return to familiar and predictable states. The competition between these two policies drives them to seek out increasingly surprising parts of the environment while learning to gain mastery over them. We show formally that the resulting algorithm maximizes coverage of the underlying state in block MDPs with stochastic observations, providing theoretical backing to our hypothesis that this procedure avoids uncontrollable and stochastic distractions. Our experiments further demonstrate that Adversarial Surprise leads to the emergence of complex and meaningful skills, and outperforms state-of-the-art unsupervised reinforcement learning methods in terms of both exploration and zero-shot transfer to downstream tasks.
翻译:不受监督的强化学习(RL) 研究 如何利用环境统计来学习有用的行为而无需付出奖励工程的成本。然而,不受监督的RL的一个中心挑战是提取对世界有实际影响的行为,涵盖可能的结果范围,同时不因内在的不可预测、无法控制和对环境的随机因素而分心。为此,我们提议了一种不受监督的RL方法,为高维、随机环境设计,其依据是两种政策(我们称之为探索和控制)之间的对抗性游戏,控制一个单一机构,并竞相竞争观测激化剂的经验。《探索》代理商试图指出,控制剂最令人惊讶的行为,而控制剂的目的是最大限度地减少意外,从而操纵环境,以回到熟悉和可预测的状态。这两项政策之间的竞争迫使它们寻找越来越令人惊讶的环境部分,同时学习掌握这些方法。我们正式表明,由此产生的算法最大限度地覆盖了MDPs的基点,通过随机观察,为我们的假设提供了理论依据,即这一程序避免了在深度的强化性实验中出现无法控制的下游实验,并转移了我们的风险性技能。