We study the problem of offline Imitation Learning (IL) where an agent aims to learn an optimal expert behavior policy without additional online environment interactions. Instead, the agent is provided with a supplementary offline dataset from suboptimal behaviors. Prior works that address this problem either require that expert data occupies the majority proportion of the offline dataset, or need to learn a reward function and perform offline reinforcement learning (RL) afterwards. In this paper, we aim to address the problem without additional steps of reward learning and offline RL training for the case when demonstrations contain a large proportion of suboptimal data. Built upon behavioral cloning (BC), we introduce an additional discriminator to distinguish expert and non-expert data. We propose a cooperation framework to boost the learning of both tasks, Based on this framework, we design a new IL algorithm, where the outputs of discriminator serve as the weights of the BC loss. Experimental results show that our proposed algorithm achieves higher returns and faster training speed compared to baseline algorithms.
翻译:我们研究离线模拟学习(IL)的问题,即代理商的目的是学习最佳的专家行为政策,而没有额外的在线环境互动。相反,代理商获得了来自亚最佳行为的补充离线数据集。以前解决这一问题的工作要么要求专家数据占离线数据集的大部分,要么需要学习奖励功能,然后进行离线强化学习(RL)。在本文件中,我们的目标是在演示包含大量亚最佳数据的情况下,不采取奖励学习和离线RL培训的额外步骤来解决这一问题。在行为克隆(BC)的基础上,我们引入了额外的区分专家和非专家数据的区分器。我们提议了一个合作框架,以促进这两项任务的学习。基于这个框架,我们设计了新的 IL算法,其中歧视者的产出可以作为不列公司损失的权重。实验结果表明,我们提议的算法比基线算法获得更高的回报和更快的培训速度。