We study the offline meta-reinforcement learning (OMRL) problem, a paradigm which enables reinforcement learning (RL) algorithms to quickly adapt to unseen tasks without any interactions with the environments, making RL truly practical in many real-world applications. This problem is still not fully understood, for which two major challenges need to be addressed. First, offline RL usually suffers from bootstrapping errors of out-of-distribution state-actions which leads to divergence of value functions. Second, meta-RL requires efficient and robust task inference learned jointly with control policy. In this work, we enforce behavior regularization on learned policy as a general approach to offline RL, combined with a deterministic context encoder for efficient task inference. We propose a novel negative-power distance metric on bounded context embedding space, whose gradients propagation is detached from the Bellman backup. We provide analysis and insight showing that some simple design choices can yield substantial improvements over recent approaches involving meta-RL and distance metric learning. To the best of our knowledge, our method is the first model-free and end-to-end OMRL algorithm, which is computationally efficient and demonstrated to outperform prior algorithms on several meta-RL benchmarks.
翻译:我们研究离线元加强学习(OMRL)问题,这是一个范例,它使强化学习(RL)算法能够在不与环境发生任何互动的情况下迅速适应无形任务,使RL在许多现实世界应用中真正实用。这个问题还没有得到完全理解,对此需要解决两大挑战。首先,离线RL通常会遇到导致不同价值功能功能差异的离散状态动作的绊脚钉错误。第二,Met-RL需要与控制政策共同学习高效和稳健的任务推理。在这项工作中,我们将学习后的政策规范化作为离线RL的一般方法,结合一种确定性环境的编码器,以便有效地推理任务。我们提出了关于封闭环境嵌入空间的新的负功率距离指标,其梯度传播与Bellman的备份脱钩。我们提供分析和洞察看,表明一些简单的设计选择可以大大改进最近采用的方法,包括元-RL和远程计量学习。对于我们的知识来说,我们的最佳方法是将所学的政策作为离线的脱模式和端到端端对离线的路径进行规范,并结合一种确定确定有效任务的编码。我们之前的模型和端至端等式的磁测算方法,以显示若干效率的元分析。