We study the problem of inverse reinforcement learning (IRL), where the learning agent recovers a reward function using expert demonstrations. Most of the existing IRL techniques make the often unrealistic assumption that the agent has access to full information about the environment. We remove this assumption by developing an algorithm for IRL in partially observable Markov decision processes (POMDPs), where an agent cannot directly observe the current state of the POMDP. The algorithm addresses several limitations of existing techniques that do not take the \emph{information asymmetry} between the expert and the agent into account. First, it adopts causal entropy as the measure of the likelihood of the expert demonstrations as opposed to entropy in most existing IRL techniques and avoids a common source of algorithmic complexity. Second, it incorporates task specifications expressed in temporal logic into IRL. Such specifications may be interpreted as side information available to the learner a priori in addition to the demonstrations, and may reduce the information asymmetry between the expert and the agent. Nevertheless, the resulting formulation is still nonconvex due to the intrinsic nonconvexity of the so-called \emph{forward problem}, i.e., computing an optimal policy given a reward function, in POMDPs. We address this nonconvexity through sequential convex programming and introduce several extensions to solve the forward problem in a scalable manner. This scalability allows computing policies that incorporate memory at the expense of added computational cost yet also achieves higher performance compared to memoryless policies. We demonstrate that, even with severely limited data, the algorithm learns reward functions and policies that satisfy the task and induce a similar behavior to the expert by leveraging the side information and incorporating memory into the policy.
翻译:我们研究反向强化学习(IRL)的问题, 学习代理商在其中利用专家演示来恢复奖励功能。 现有的IRL技术大多使该代理商获得关于环境的全部信息的假设往往不切实际, 我们通过在部分可见的Markov 决策程序中为IRL开发一个算法( POMDPs ), 代理商无法直接观察 POMDP 的当前状态。 算法解决了一些现有技术的局限性, 这些技术不把专家和代理商之间的信息不对称情况考虑在内。 首先, 它采用因果递增作为衡量专家演示的可能性的尺度, 而不是衡量大多数现有IRL技术中该代理商获得关于环境的全面信息。 其次, 将时间逻辑表达的任务规格纳入IMLL。 这些规格可以被解释为除了演示外, 也可以减少专家与代理商之间的信息不对称性。 然而, 由此形成的配方仍然不易解, 这是因为专家演示的内在的不兼容性, 比较专家的推导/ 将成本政策推算的精度, 将这种精细的精细的递性政策 引入了前向后序的递化的计算, 将精细的精细的精度 。 将精细的精细的精细的精细的精细的精细的计算, 将精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的精细的计算, 。