离线强化学习作为一个大序列建模问题 (Offline Reinforcement Learning as One Big Sequence Modeling Problem)

Reinforcement learning (RL) is typically concerned with estimating stationary policies or single-step models, leveraging the Markov property to factorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide effective solutions to the RL problem. To this end, we explore how RL can be tackled with the tools of sequence modeling, using a Transformer architecture to model distributions over trajectories and repurposing beam search as a planning algorithm. Framing RL as sequence modeling problem simplifies a range of design decisions, allowing us to dispense with many of the components common in offline RL algorithms. We demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.

翻译：强化学习(RL)通常与估计固定政策或单步模型有关,利用Markov 属性来将问题在时间上考虑到。然而,我们也可以将RL视为一个通用序列建模问题,目标是产生一系列行动,导致产生一系列高回报序列。这样看来,我们很愿意考虑高容量序列预测模型是否也能为RL问题提供有效的解决办法,例如在自然语言处理等其他领域行之有效。为此,我们探索如何用序列建模工具解决RL,使用变异器结构来模拟在轨迹上的分布,并重新定位光束搜索,作为一种规划算法。Frammho RL作为测序问题的序列,简化了一系列设计决定,使我们能够放弃在离线RL算法中常见的许多组成部分。我们展示了这一方法在长方位动态预测、模拟学习、目标设定RL和离线RL中的灵活性。此外,我们展示了这一方法可以与现有的无型模型、无弹性的低压产出算法相结合。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

回顾机器学习公平的数学框架，Review of Mathematical frameworks for Fairness in Machine Learning

专知会员服务

38+阅读 · 2020年5月30日

100+篇《自监督学习(Self-Supervised Learning)》论文最新合集

专知会员服务

166+阅读 · 2020年3月18日

【AAAI2020教程】强化学习中的Exploration-Exploitation in Reinforcement Learning

专知会员服务

101+阅读 · 2020年2月8日

深度强化学习策略梯度教程，53页ppt

专知会员服务

184+阅读 · 2020年2月1日