This paper presents a deep Inverse Reinforcement Learning (IRL) framework that can learn an a priori unknown number of nonlinear reward functions from unlabeled experts' demonstrations. For this purpose, we employ the tools from Dirichlet processes and propose an adaptive approach to simultaneously account for both complex and unknown number of reward functions. Using the conditional maximum entropy principle, we model the experts' multi-intention behaviors as a mixture of latent intention distributions and derive two algorithms to estimate the parameters of the deep reward network along with the number of experts' intentions from unlabeled demonstrations. The proposed algorithms are evaluated on three benchmarks, two of which have been specifically extended in this study for multi-intention IRL, and compared with well-known baselines. We demonstrate through several experiments the advantages of our algorithms over the existing approaches and the benefits of online inferring, rather than fixing beforehand, the number of expert's intentions.
翻译:本文介绍了一个深反强化学习(IRL)框架,它能够从未贴标签的专家演示中学习一个事先未知数量的非线性奖赏功能。 为此,我们使用Drichlet流程的工具,并提议一种适应性办法,同时核算复杂和未知的奖赏功能。我们采用有条件的最大限度的催化原则,将专家的多重意图行为作为潜在意图分布的混合体进行模拟,并得出两种算法来估计深奖赏网络的参数以及未贴标签的演示中专家的意图数量。提议的算法按照三个基准进行评估,其中两个基准在本研究中具体扩展,用于多目的的IRL,并与众所周知的基线进行比较。我们通过若干实验,展示了我们算法相对于现有方法的优势以及在线推断的好处,而不是事先确定专家意图的数量。