通过远程计量学习和行为规范化,提高全离子高效全网元加强学习 (Efficient Fully-Offline Meta-Reinforcement Learning via Distance Metric Learning and Behavior Regularization)

We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.

翻译：我们研究离线元加强学习(OMRL)问题,这是一个范例,它使强化学习(RL)算法能够在不与环境发生任何互动的情况下迅速适应无形任务,使RL在许多现实世界应用中真正实用。这个问题还没有得到完全理解,对此需要解决两大挑战。首先,离线RL通常会遇到导致不同价值功能功能差异的离散状态动作的绊脚钉错误。第二,Met-RL需要与控制政策共同学习高效和稳健的任务推理。在这项工作中,我们将学习后的政策规范化作为离线RL的一般方法,结合一种确定性环境的编码器,以便有效地推理任务。我们提出了关于封闭环境嵌入空间的新的负功率距离指标,其梯度传播与Bellman的备份脱钩。我们提供分析和洞察看,表明一些简单的设计选择可以大大改进最近采用的方法,包括元-RL和远程计量学习。对于我们的知识来说,我们的最佳方法是将所学的政策作为离线的脱模式和端到端端对离线的路径进行规范,并结合一种确定确定有效任务的编码。我们之前的模型和端至端等式的磁测算方法,以显示若干效率的元分析。

相关内容

度量学习

关注 3372

度量学习的目的为了衡量样本之间的相近程度，而这也正是模式识别的核心问题之一。大量的机器学习方法，比如K近邻、支持向量机、径向基函数网络等分类方法以及K-means聚类方法，还有一些基于图的方法，其性能好坏都主要有样本之间的相似度量方法的选择决定。度量学习通常的目标是使同类样本之间的距离尽可能缩小，不同类样本之间的距离尽可能放大。

「元强化学习」报告，斯坦福Chelsea Finn讲解，52页ppt，Meta Reinforcement Learning

专知会员服务

42+阅读 · 2021年1月11日

最新《自监督表示学习》报告，70页ppt

专知会员服务

86+阅读 · 2020年12月22日

【ICML2020-伯克利】稳定非策略强化学习的表示，Representations for Stable Off-Policy Reinforcement Learning

专知会员服务

17+阅读 · 2020年7月14日

因果图，Causal Graphs，52页ppt

专知会员服务

253+阅读 · 2020年4月19日