How to extract as much learning signal from each trajectory data has been a key problem in reinforcement learning (RL), where sample inefficiency has posed serious challenges for practical applications. Recent works have shown that using expressive policy function approximators and conditioning on future trajectory information -- such as future states in hindsight experience replay or returns-to-go in Decision Transformer (DT) -- enables efficient learning of multi-task policies, where at times online RL is fully replaced by offline behavioral cloning, e.g. sequence modeling. We demonstrate that all these approaches are doing hindsight information matching (HIM) -- training policies that can output the rest of trajectory that matches some statistics of future state information. We present Generalized Decision Transformer (GDT) for solving any HIM problem, and show how different choices for the feature function and the anti-causal aggregator not only recover DT as a special case, but also lead to novel Categorical DT (CDT) and Bi-directional DT (BDT) for matching different statistics of the future. For evaluating CDT and BDT, we define offline multi-task state-marginal matching (SMM) and imitation learning (IL) as two generic HIM problems, propose a Wasserstein distance loss as a metric for both, and empirically study them on MuJoCo continuous control benchmarks. CDT, which simply replaces anti-causal summation with anti-causal binning in DT, enables the first effective offline multi-task SMM algorithm that generalizes well to unseen and even synthetic multi-modal state-feature distributions. BDT, which uses an anti-causal second transformer as the aggregator, can learn to model any statistics of the future and outperforms DT variants in offline multi-task IL. Our generalized formulations from HIM and GDT greatly expand the role of powerful sequence modeling architectures in modern RL.
翻译:如何从每个轨迹数据中提取如此多的学习信号一直是强化学习(RL)的一个关键问题,因为抽样低效率给实际应用带来了严重挑战。最近的工作表明,使用直观政策功能的匹配器和对未来轨迹信息进行调节 -- -- 如后视经验的未来状态重放或者在决定变换器(DT)中返回到轨道信息 -- -- 能够有效地学习多任务政策,有时在线RL会被离线行为克隆完全取代,例如,序列建模。我们表明,所有这些方法都在进行后视信息匹配(HIM) -- -- 能够输出与未来信息某些统计数据相匹配的剩余轨迹的培训政策。我们展示了通用决策变换器(GDT),并展示了功能功能和反致癌聚合器的不同选择,不仅将DT作为特例收回,而且将二亚离子变异变变变数据(CDT)和BIDD(BDT) 用于匹配未来不同的统计(HTTT和BDT),我们把IMS和MDM(S的远程变数变换为CD-L),我们定义了S-LMDMDDM(C-L) 和M-DMD)的多行变变变变换的模型。