Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results. The Decision Transformer (DT) combines the conditional policy approach and a transformer architecture, showing competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL to learn the optimal policy from sub-optimal trajectories. This issue becomes particularly significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not have the same limitation; however, they suffer from unstable learning behaviours, especially when they rely on function approximation in an off-policy learning setting. In this paper, we propose the Q-learning Decision Transformer (QDT) to address the shortcomings of DT by leveraging the benefits of Dynamic Programming (Q-learning). It utilises the Dynamic Programming results to relabel the return-to-go in the training data to then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We empirically show these in both simple toy environments and the more complex D4RL benchmark, showing competitive performance gains.
翻译:最近的工作表明,通过有条件的政策解决离线强化学习(RL)会产生有希望的结果。决定变换器(DT)将有条件的政策办法和变压器结构结合起来,显示与若干基准相比的竞争性业绩。然而,DT缺乏缝合能力 -- -- 脱线的RL从亚最佳轨迹学习最佳政策的关键能力之一。当离线数据集仅包含亚最佳轨迹时,这一问题就变得特别重要。另一方面,基于动态方案拟订(例如Q学习)的传统RL方法没有相同的限制;然而,它们受到不稳定的学习行为的影响,特别是当它们依赖离政策学习环境的功能近似时。在本文件中,我们建议Q学习决定变换器(QDTT)通过利用动态方案拟订(Q-学习)的好处来解决DT的缺点。它利用动态方案拟订结果在培训数据中重新标出返回到返回的数据,然后用重新标签的数据来培训DT。我们的方法有效地利用了这两种方法的效益,在非政策学习环境中,从而更好地展示了这些比较的成绩。