Recent works have shown that tackling offline reinforcement learning (RL) with a conditional policy produces promising results by converting the RL task to a supervised learning task. Decision Transformer (DT) combines the conditional policy approach and Transformer architecture to show competitive performance against several benchmarks. However, DT lacks stitching ability -- one of the critical abilities for offline RL that learns the optimal policy from sub-optimal trajectories. The issue becomes significant when the offline dataset only contains sub-optimal trajectories. On the other hand, the conventional RL approaches based on Dynamic Programming (such as Q-learning) do not suffer the same issue; however, they suffer from unstable learning behaviours, especially when it employs function approximation in an off-policy learning setting. In this paper, we propose Q-learning Decision Transformer (QDT) that addresses the shortcomings of DT by leveraging the benefit of Dynamic Programming (Q-learning). QDT utilises the Dynamic Programming (Q-learning) results to relabel the return-to-go in the training data. We then train the DT with the relabelled data. Our approach efficiently exploits the benefits of these two approaches and compensates for each other's shortcomings to achieve better performance. We demonstrate the issue of DT and the advantage of QDT in a simple environment. We also evaluate QDT in the more complex D4RL benchmark showing good performance gains.
翻译:最近的工作表明,通过有条件的政策解决脱机强化学习(RL),通过将脱机强化学习(RL)的任务转换成受监督的学习任务,取得了有希望的成果。决定变换(DT)将有条件的政策办法和变换(DT)结构结合起来,以显示与若干基准相比的竞争性业绩。然而,DT缺乏缝合能力 -- -- 脱机学习亚最佳轨迹的最佳政策的关键能力之一。当离机数据集仅包含亚最佳轨迹时,这个问题就变得重要。另一方面,基于动态方案拟订(例如Q-学习)的传统RL方法并不遭遇同样的问题;然而,它们却受到不稳定的学习行为的影响,特别是当它利用离机强化的政策学习环境,从亚最佳轨迹中学习最佳政策。当离机数据集只包含次最佳轨迹时,这个问题就变得重要。QDT(Q-学习)利用动态方案拟订(Q-学习)的结果,在培训数据中重新标注返回的回路程(例如Q-学习);然而,它们却受到不稳定的学习行为,特别是当它利用离离机的近功能时,我们利用了这些差的成绩。我们有效地利用了每一个DTDTDTDDT方法的成绩。