Recent work has shown that offline reinforcement learning (RL) can be formulated as a sequence modeling problem (Chen et al., 2021; Janner et al., 2021) and solved via approaches similar to large-scale language modeling. However, any practical instantiation of RL also involves an online component, where policies pretrained on passive offline datasets are finetuned via taskspecific interactions with the environment. We propose Online Decision Transformers (ODT), an RL algorithm based on sequence modeling that blends offline pretraining with online finetuning in a unified framework. Our framework uses sequence-level entropy regularizers in conjunction with autoregressive modeling objectives for sample-efficient exploration and finetuning. Empirically, we show that ODT is competitive with the state-of-the-art in absolute performance on the D4RL benchmark but shows much more significant gains during the finetuning procedure.
翻译:最近的工作表明,离线强化学习(RL)可以作为一个序列建模问题来制定(Chen等人,2021年;Janner等人,2021年),并通过类似于大规模语言建模的办法加以解决,然而,任何实际的RL即时化还涉及一个在线部分,通过与环境的具体任务互动,对被动离线数据集预先培训的政策进行微调。我们建议在线决定变换器(ODT)是一种基于序列建模的RL算法,它将离线前训练与在线微调混合在一起,在一个统一的框架内,我们的框架使用序列级的诱导调节器与自动递增式建模目标相结合,用于抽样高效的勘探和微调。我们很生动地表明,ODT在D在D4RL基准绝对性表现方面与最新水平相比具有竞争力,但在微调程序期间显示出更大的收益。