A promising paradigm for offline reinforcement learning (RL) is to constrain the learned policy to stay close to the dataset behaviors, known as policy constraint offline RL. However, existing works heavily rely on the purity of the data, exhibiting performance degradation or even catastrophic failure when learning from contaminated datasets containing impure trajectories of diverse levels. e.g., expert level, medium level, etc., while offline contaminated data logs exist commonly in the real world. To mitigate this, we first introduce gradient penalty over the learned value function to tackle the exploding Q-functions. We then relax the closeness constraints towards non-optimal actions with critic weighted constraint relaxation. Experimental results show that the proposed techniques effectively tame the non-optimal trajectories for policy constraint offline RL methods, evaluated on a set of contaminated D4RL Mujoco and Adroit datasets.
翻译:离线强化学习(RL)的一个很有希望的范例是,限制学习的政策接近数据集行为,称为离线限制政策。 然而,现有工作在很大程度上依赖于数据的纯度,在从含有不同层次的不洁轨迹的受污染数据集学习时,表现性退化甚至灾难性失败,例如专家级别、中等级别等,而离线污染数据日志在现实世界中通常存在。为了减轻这一影响,我们首先对已学的价值功能实行梯度处罚,以对付爆炸的功能。然后,我们将封闭性限制放松到非最佳行动,同时进行批评性加权限制放松。实验结果显示,拟议的技术有效地抑制了离线限制政策方法的非最佳轨迹,对一组受污染的D4RL Mujoco和Adroit数据集进行了评价。