We consider the problem of learning the behavioral preferences of an expert engaged in a task from noisy and partially-observable demonstrations. This is motivated by real-world applications such as a line robot learning from observing a human worker, where some observations are occluded by environmental objects that cannot be removed. Furthermore, robotic perception tends to be imperfect and noisy. Previous techniques for inverse reinforcement learning (IRL) take the approach of either omitting the missing portions or inferring it as part of expectation-maximization, which tends to be slow and prone to local optima. We present a new method that generalizes the well-known Bayesian maximum-a-posteriori (MAP) IRL method by marginalizing the occluded portions of the trajectory. This is additionally extended with an observation model to account for perception noise. We show that the marginal MAP (MMAP) approach significantly improves on the previous IRL technique under occlusion in both formative evaluations on a toy problem and in a summative evaluation on an onion sorting line task by a robot.
翻译:我们考虑的是,从吵闹和部分可观察的演示中学习专家行为偏好的问题。这受到现实世界应用的驱动,如线性机器人从观察人类工人中学习,有些观测被无法去除的环境物体所隐蔽。此外,机器人的认知往往不完善而且很吵。以前反强化学习的技术(IRL)采取的方法要么省略缺失的部分,要么将它作为期望-最大化的一部分,这种预期-最大化往往缓慢,容易偏向于本地opima。我们提出了一个新方法,将众所周知的Bayesian最大-a-posteri (MAP) IRL (IMA) (IMAP) (IMAP) (IMAP) (IMAP) (IMAP) (IMAP) (IMAP) (IMAP) (IMAP) (I) (I) (IL) (I) (IRL) (I) (IL) (IL (I) (IPL) (I) (IPOTIM) (I) (I) (I) (I) (I) (I) (IPIPOPI) (IPI) (I) (I) (I) (IPIPI) (I) (I) (I) (IPIPI) (I) (IPI) (I) (I) (I) (I) (I) (I) (I) (PI) (I) (I) (I) (I) (I) (I) (I) (I) (I) (PI) (IL) (PI) (PI) (PIPI) (I) (I) (I) (I) (I) (PI) (I) (I) (P) (P) (I) (I) (P) (I) (P) (I) (I) (I) (I) (I) (I) (I) (I) (IPI) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I) (I)