With the advent of large datasets, offline reinforcement learning (RL) is a promising framework for learning good decision-making policies without the need to interact with the real environment. However, offline RL requires the dataset to be reward-annotated, which presents practical challenges when reward engineering is difficult or when obtaining reward annotations is labor-intensive. In this paper, we introduce Optimal Transport Reward labeling (OTR), an algorithm that assigns rewards to offline trajectories, with a few high-quality demonstrations. OTR's key idea is to use optimal transport to compute an optimal alignment between an unlabeled trajectory in the dataset and an expert demonstration to obtain a similarity measure that can be interpreted as a reward, which can then be used by an offline RL algorithm to learn the policy. OTR is easy to implement and computationally efficient. On D4RL benchmarks, we show that OTR with a single demonstration can consistently match the performance of offline RL with ground-truth rewards.
翻译:随着大数据集的出现,离线强化学习框架是一种非常有前途的学习好的决策政策的方法,无需与实际环境进行交互。但是,离线强化学习需要对数据集进行奖励注释,当奖励工程困难或获得奖励注释耗费大量人力时,会带来实际难题。在本文中,我们引入了最佳传输奖励标记(OTR)算法,该算法利用一些高质量的演示将奖励赋予离线轨迹。OTR的关键思想是使用最佳传输计算无标签数据集中的轨迹与专家演示之间的最优对齐,以得到可解释为奖励的相似性度量,然后离线强化学习算法可以利用该度量学习决策政策。OTR易于实现且计算效率高。在D4RL基准测试中,我们展示了使用单一演示的OTR可以持续地匹配使用基于真实奖励的离线强化学习的性能。