We study the problem of inverse reinforcement learning (IRL), where the learning agent recovers a reward function using expert demonstrations. Most of the existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs). The algorithm addresses several limitations of existing techniques that do not take the information asymmetry between the expert and the learner into account. First, it adopts causal entropy as the measure of the likelihood of the expert demonstrations as opposed to entropy in most existing IRL techniques, and avoids a common source of algorithmic complexity. Second, it incorporates task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations and may reduce the information asymmetry. Nevertheless, the resulting formulation is still nonconvex due to the intrinsic nonconvexity of the so-called forward problem, i.e., computing an optimal policy given a reward function, in POMDPs. We address this nonconvexity through sequential convex programming and introduce several extensions to solve the forward problem in a scalable manner. This scalability allows computing policies that incorporate memory at the expense of added computational cost yet also outperform memoryless policies. We demonstrate that, even with severely limited data, the algorithm learns reward functions and policies that satisfy the task and induce a similar behavior to the expert by leveraging the side information and incorporating memory into the policy.
翻译:我们研究了反向强化学习(IRL)的问题,学习代理机构利用专家演示恢复了奖励功能。现有的IRL技术大多使通常不切实际的假设,即该代理机构能够获得关于环境的全部信息。我们通过在部分可观测的Markov 决策程序中为IRL开发一种算法(POMDPs)来消除这一假设。算法解决了不考虑专家和学习者之间信息不对称的现有技术的若干局限性。首先,它采用因果连接作为衡量专家演示可能性的尺度,而不是衡量大多数现有IRL技术中的耐性功能,并避免了算法复杂性的常见来源。第二,它将时间逻辑中表达的任务规格纳入IRL。除了演示外,这些规格可以被解释为向学习者提供边际信息,并可能减少信息不对称。然而,由于所谓的远期问题的内在不一致性,因此,它采用因某种奖励功能而计算出一种最佳的政策。我们通过直线逻辑逻辑逻辑将这种不兼容性的定义纳入ILLLL政策,我们通过连续的递增缩成本计算法,从而将成本推算算出一个不连续的递增成本计算方法,从而将成本推算出一个不易算出成本的递化的递化的递化的递化的递进成本的递进的递进式的递进的递进的递进的递进进进进进进进进进式数据。