Unsupervised reinforcement learning aims at learning a generalist policy in a reward-free manner for fast adaptation to downstream tasks. Most of the existing methods propose to provide an intrinsic reward based on surprise. Maximizing or minimizing surprise drives the agent to either explore or gain control over its environment. However, both strategies rely on a strong assumption: the entropy of the environment's dynamics is either high or low. This assumption may not always hold in real-world scenarios, where the entropy of the environment's dynamics may be unknown. Hence, choosing between the two objectives is a dilemma. We propose a novel yet simple mixture of policies to address this concern, allowing us to optimize an objective that simultaneously maximizes and minimizes the surprise. Concretely, we train one mixture component whose objective is to maximize the surprise and another whose objective is to minimize the surprise. Hence, our method does not make assumptions about the entropy of the environment's dynamics. We call our method a $\textbf{M}\text{ixture }\textbf{O}\text{f }\textbf{S}\text{urprise}\textbf{S}$ (MOSS) for unsupervised reinforcement learning. Experimental results show that our simple method achieves state-of-the-art performance on the URLB benchmark, outperforming previous pure surprise maximization-based objectives. Our code is available at: https://github.com/LeapLabTHU/MOSS.
翻译:不受监督的强化学习旨在以无报酬的方式学习通用政策,以便快速适应下游任务。 大多数现有方法都提议提供基于意外的内在奖励。 最大化或尽量减少意外促使代理商探索或控制环境。 但是, 这两种战略都依赖于一个强有力的假设: 环境动态的变异的变异性要么高要么低。 这个假设不一定总能维持在现实世界的情景中, 环境动态的变异性可能未知。 因此, 在这两个目标之间做出选择是一个两难抉择。 我们提议了一个新颖而简单的政策组合, 以解决这一关切, 使我们能够优化一个同时最大化和尽量减少意外的目标。 具体地说, 我们训练一个混合部分, 目标是最大限度地增加惊喜, 另一个目标是最大限度地减少意外。 因此, 我们的方法不会假设环境动态的变异性。 我们称我们的方法为 $ textbf{M\ text{ text{ text{ text{ text{ textb{ {Outif{O_ text{ {O_ text{ {_ text{Surview}M_Text{M_TURal_B_B_ILestal- trial_B_B_O_B_Silding legard_B_ legilding weststststststststststststalstalgilate resslategislate exstrationsstrationsstrations) exstrupslupsstrislationslational_ exgilgilgilgildgilgilgildgilds