We study offline meta-reinforcement learning, a practical reinforcement learning paradigm that learns from offline data to adapt to new tasks. The distribution of offline data is determined jointly by the behavior policy and the task. Existing offline meta-reinforcement learning algorithms cannot distinguish these factors, making task representations unstable to the change of behavior policies. To address this problem, we propose a contrastive learning framework for task representations that are robust to the distribution mismatch of behavior policies in training and test. We design a bi-level encoder structure, use mutual information maximization to formalize task representation learning, derive a contrastive learning objective, and introduce several approaches to approximate the true distribution of negative pairs. Experiments on a variety of offline meta-reinforcement learning benchmarks demonstrate the advantages of our method over prior methods, especially on the generalization to out-of-distribution behavior policies. The code is available at https://github.com/PKU-AI-Edge/CORRO.
翻译:我们研究离线元加强学习,这是一个从离线数据学习以适应新任务的实际强化学习模式。离线数据的分配是由行为政策和任务共同决定的。现有的离线元加强学习算法无法区分这些因素,使任务表现对行为政策变化的不稳定。为解决这一问题,我们建议为任务表现提供一个对比式学习框架,这种框架对培训和测试中行为政策分布不匹配具有很强的作用。我们设计了一个双级编码器结构,利用相互信息最大化来正式确定任务表现学习,得出一个对比性学习目标,并采用几种方法来估计负对子的真实分布。关于各种离线超线元加强学习基准的实验显示了我们方法比以往方法的优势,特别是在向外分配行为政策概括化方面。该代码可在https://github.com/PKU-AI-Edge/CORRO查阅。