The goal of inverse reinforcement learning (IRL) is to infer a reward function that explains the behavior of an agent performing a task. The assumption that most approaches make is that the demonstrated behavior is near-optimal. In many real-world scenarios, however, examples of truly optimal behavior are scarce, and it is desirable to effectively leverage sets of demonstrations of suboptimal or heterogeneous performance, which are easier to obtain. We propose an algorithm that learns a reward function from such demonstrations together with a weak supervision signal in the form of a distribution over rewards collected during the demonstrations (or, more generally, a distribution over cumulative discounted future rewards). We view such distributions, which we also refer to as optimality profiles, as summaries of the degree of optimality of the demonstrations that may, for example, reflect the opinion of a human expert. Given an optimality profile and a small amount of additional supervision, our algorithm fits a reward function, modeled as a neural network, by essentially minimizing the Wasserstein distance between the corresponding induced distribution and the optimality profile. We show that our method is capable of learning reward functions such that policies trained to optimize them outperform the demonstrations used for fitting the reward functions.
翻译:反强化学习(IRL) 的目的是推断一种奖励功能,用以解释执行某项任务的代理人的行为。多数方法的假设是,所显示的行为是接近最佳的。然而,在许多现实世界的情景中,真正最佳行为的例子很少,因此最好有效地利用一系列非最佳或不同性能的示范,这些示范比较容易获得。我们建议一种算法,从这种示范中学习奖励功能,同时从微弱的监督信号中以分配在演示期间收集的奖赏的形式(或更一般地说,分配超过累积的折扣未来奖赏)获得一种微弱的监督信号。我们把这种分布看成是最佳性特征,作为展示最佳程度的概要,可以反映人类专家的意见。考虑到最佳性特征和少量的额外监督,我们的算法符合奖励功能,以神经网络为模型,基本上将相应的诱导分配和最佳性形象之间的瓦勒斯坦距离降到最低。我们表明我们的方法能够学习奖励功能,例如经过训练的政策,使其优于用于完善的奖赏功能。