Reinforcement learning (RL) is typically concerned with estimating stationary policies or single-step models, leveraging the Markov property to factorize problems in time. However, we can also view RL as a generic sequence modeling problem, with the goal being to produce a sequence of actions that leads to a sequence of high rewards. Viewed in this way, it is tempting to consider whether high-capacity sequence prediction models that work well in other domains, such as natural-language processing, can also provide effective solutions to the RL problem. To this end, we explore how RL can be tackled with the tools of sequence modeling, using a Transformer architecture to model distributions over trajectories and repurposing beam search as a planning algorithm. Framing RL as sequence modeling problem simplifies a range of design decisions, allowing us to dispense with many of the components common in offline RL algorithms. We demonstrate the flexibility of this approach across long-horizon dynamics prediction, imitation learning, goal-conditioned RL, and offline RL. Further, we show that this approach can be combined with existing model-free algorithms to yield a state-of-the-art planner in sparse-reward, long-horizon tasks.
翻译:强化学习(RL)通常与估计固定政策或单步模型有关,利用Markov 属性来将问题在时间上考虑到。然而,我们也可以将RL视为一个通用序列建模问题,目标是产生一系列行动,导致产生一系列高回报序列。这样看来,我们很愿意考虑高容量序列预测模型是否也能为RL问题提供有效的解决办法,例如在自然语言处理等其他领域行之有效。为此,我们探索如何用序列建模工具解决RL,使用变异器结构来模拟在轨迹上的分布,并重新定位光束搜索,作为一种规划算法。Frammho RL作为测序问题的序列,简化了一系列设计决定,使我们能够放弃在离线RL算法中常见的许多组成部分。我们展示了这一方法在长方位动态预测、模拟学习、目标设定RL和离线RL中的灵活性。此外,我们展示了这一方法可以与现有的无型模型、无弹性的低压产出算法相结合。