Offline reinforcement learning (RL) aims at learning policies from previously collected static trajectory data without interacting with the real environment. Recent works provide a novel perspective by viewing offline RL as a generic sequence generation problem, adopting sequence models such as Transformer architecture to model distributions over trajectories, and repurposing beam search as a planning algorithm. However, the training datasets utilized in general offline RL tasks are quite limited and often suffer from insufficient distribution coverage, which could be harmful to training sequence generation models yet has not drawn enough attention in the previous works. In this paper, we propose a novel algorithm named Bootstrapped Transformer, which incorporates the idea of bootstrapping and leverages the learned model to self-generate more offline data to further boost the sequence model training. We conduct extensive experiments on two offline RL benchmarks and demonstrate that our model can largely remedy the existing offline RL training limitations and beat other strong baseline methods. We also analyze the generated pseudo data and the revealed characteristics may shed some light on offline RL training. The codes are available at https://seqml.github.io/bootorl.
翻译:离线强化学习(RL)旨在从先前收集的静态轨迹数据中学习政策,而没有与实际环境互动。最近的工作提供了一个新视角,将离线RL视为一个通用序列生成问题,采用诸如变换器结构等序列模型模型,以模型在轨迹上进行分布,并重新定位光束搜索作为一种规划算法。然而,一般离线RL任务使用的培训数据集非常有限,而且往往因分布覆盖面不足而受到影响,这可能有害于培训序列生成模型,但在先前的作品中尚未引起足够重视。在本文件中,我们提出了一个名为“勃起式变换器”的新式算法,其中包括了靴式的构想,并利用了学习过的模型来进一步增强自导离线数据,以进一步加强序列模型培训。我们在两个离线RL基准上进行了广泛的实验,并表明我们的模型可以在很大程度上补救现有的离线RL培训限制并击打其他强大的基线方法。我们还分析了生成的假数据以及暴露的特征可能会在离线RL培训上留下一些光。在 https://sqemmpl.giubliobio/bootorl上可以找到这些代码。