This work aims to tackle a major challenge in offline Inverse Reinforcement Learning (IRL), namely the reward extrapolation error, where the learned reward function may fail to explain the task correctly and misguide the agent in unseen environments due to the intrinsic covariate shift. Leveraging both expert data and lower-quality diverse data, we devise a principled algorithm (namely CLARE) that solves offline IRL efficiently via integrating "conservatism" into a learned reward function and utilizing an estimated dynamics model. Our theoretical analysis provides an upper bound on the return gap between the learned policy and the expert policy, based on which we characterize the impact of covariate shift by examining subtle two-tier tradeoffs between the exploitation (on both expert and diverse data) and exploration (on the estimated dynamics model). We show that CLARE can provably alleviate the reward extrapolation error by striking the right exploitation-exploration balance therein. Extensive experiments corroborate the significant performance gains of CLARE over existing state-of-the-art algorithms on MuJoCo continuous control tasks (especially with a small offline dataset), and the learned reward is highly instructive for further learning.
翻译:这项工作旨在应对线外反强化学习(IRL)中的一项重大挑战,即奖励外推错误,因为由于内在的共变变化,所学的奖励功能可能无法正确解释任务,错误地引导代理人在无形环境中的工作。利用专家数据和低质量的多样化数据,我们设计了一种原则性算法(即CLARE),通过将“保守主义”纳入学习的奖励功能并利用估计动态模型,有效地解决离线的IRL问题。我们的理论分析为学习的政策与专家政策之间的回报差距提供了上限,我们以此为根据,通过审查开发(专家数据和多样性数据)与探索(估计的动态模型)之间的微妙的两层权衡(两者的权衡)来描述共变的影响。我们表明,CLARE可以通过其中的正确开发-勘探平衡来有效减轻奖励外推错误。广泛的实验证实了CLARE在穆乔科现有州级连续控制任务(特别是小规模离线数据集)中的重大业绩收益,而所学的奖励对于进一步学习具有高度的启发性。