LiMIIRL: 轻量级多份内反向强化学习 (LiMIIRL: Lightweight Multiple-Intent Inverse Reinforcement Learning)

Multiple-Intent Inverse Reinforcement Learning (MI-IRL) seeks to find a reward function ensemble to rationalize demonstrations of different but unlabelled intents. Within the popular expectation maximization (EM) framework for learning probabilistic MI-IRL models, we present a warm-start strategy based on up-front clustering of the demonstrations in feature space. Our theoretical analysis shows that this warm-start solution produces a near-optimal reward ensemble, provided the behavior modes satisfy mild separation conditions. We also propose a MI-IRL performance metric that generalizes the popular Expected Value Difference measure to directly assesses learned rewards against the ground-truth reward ensemble. Our metric elegantly addresses the difficulty of pairing up learned and ground truth rewards via a min-cost flow formulation, and is efficiently computable. We also develop a MI-IRL benchmark problem that allows for more comprehensive algorithmic evaluations. On this problem, we find our MI-IRL warm-start strategy helps avoid poor quality local minima reward ensembles, resulting in a significant improvement in behavior clustering. Our extensive sensitivity analysis demonstrates that the quality of the learned reward ensembles is improved under various settings, including cases where our theoretical assumptions do not necessarily hold. Finally, we demonstrate the effectiveness of our methods by discovering distinct driving styles in a large real-world dataset of driver GPS trajectories.

翻译：多重反向强化学习(MI-IRL) 寻求一种奖励功能,使不同但无标签意图的示范活动合理化。在学习概率性MI-IRL模型的公众期望最大化框架范围内,我们提出了一个基于在地表空间将演示活动组合在一起的热启动战略。我们的理论分析表明,这一热启动解决方案产生了近于最佳的奖励组合,条件是行为模式满足温和的分离条件。我们还提出了一个MI-IRL绩效衡量标准,将流行的预期值差异衡量标准概括化,以直接评估根据地平线奖励组合直接评估学到的奖赏。我们在人们期望最大化框架内,通过微成本流动模型的组合,提出了一种以先入为主的组合和地面真相奖励的热度战略。我们还开发了一个MI-IRL基准问题,从而可以进行更全面的算法评估。关于这一问题,我们发现我们的MI-IRL温和启动战略有助于避免质量低下的当地微值奖赏组合,从而极大地改进了行为组合。我们的广泛敏感度分析表明,在大型的驱动力模型中,我们最终能够证明我们所学到的数据质量。