Offline goal-conditioned reinforcement learning (GCRL) can be challenging due to overfitting to the given dataset. To generalize agents' skills outside the given dataset, we propose a goal-swapping procedure that generates additional trajectories. To alleviate the problem of noise and extrapolation errors, we present a general offline reinforcement learning method called deterministic Q-advantage policy gradient (DQAPG). In the experiments, DQAPG outperforms state-of-the-art goal-conditioned offline RL methods in a wide range of benchmark tasks, and goal-swapping further improves the test results. It is noteworthy, that the proposed method obtains good performance on the challenging dexterous in-hand manipulation tasks for which the prior methods failed.
翻译:离线目标附加强化学习(GCRL)可能具有挑战性,因为过于适应给定数据集。为了在给定数据集之外推广代理人的技能,我们提议了一个目标转换程序,产生更多的轨迹。为了缓解噪音和外推误差问题,我们提出了一个一般的离线强化学习方法,称为确定性Q-优势政策梯度(DQAPG)。在实验中,DQAPG在一系列广泛的基准任务中优于最先进的、有目标限制的离线RL方法,而目标转换则进一步改善测试结果。值得注意的是,拟议的方法在具有挑战性的手工操作操作任务上取得了良好的业绩,而先前的方法未能成功。