Offline inverse reinforcement learning (Offline IRL) aims to recover the structure of rewards and environment dynamics that underlie observed actions in a fixed, finite set of demonstrations from an expert agent. Accurate models of expertise in executing a task has applications in safety-sensitive applications such as clinical decision making and autonomous driving. However, the structure of an expert's preferences implicit in observed actions is closely linked to the expert's model of the environment dynamics (i.e. the ``world''). Thus, inaccurate models of the world obtained from finite data with limited coverage could compound inaccuracy in estimated rewards. To address this issue, we propose a bi-level optimization formulation of the estimation task wherein the upper level is likelihood maximization based upon a conservative model of the expert's policy (lower level). The policy model is conservative in that it maximizes reward subject to a penalty that is increasing in the uncertainty of the estimated model of the world. We propose a new algorithmic framework to solve the bi-level optimization problem formulation and provide statistical and computational guarantees of performance for the associated reward estimator. Finally, we demonstrate that the proposed algorithm outperforms the state-of-the-art offline IRL and imitation learning benchmarks by a large margin, over the continuous control tasks in MuJoCo and different datasets in the D4RL benchmark.
翻译:脱线强化学习(脱线强化学习)旨在从专家代理人的固定、有限的示范中恢复观察到的行动所基于的奖赏和环境动态结构。执行任务的准确专门知识模型适用于对安全敏感的应用,如临床决策和自主驾驶等。然而,观察到的行动所隐含的专家偏好的结构与专家的环境动态模型(即“世界'”)密切相关。因此,从有限数据中获得的、覆盖面有限的世界不准确模型可能使估计奖赏的准确性更加复杂。为解决这一问题,我们提议对估算任务进行双级优化,根据保守的专家政策模型(低水平),将最高水平的可能性最大化。政策模型保守,因为最大限度地提高奖励,但处罚越来越严重,因为估计的世界模型(即“世界'”)的不确定性正在增加。我们提议一个新的算法框架,以解决双级优化问题,并为相关奖赏估测官提供统计和计算性业绩的保证。最后,我们提议对最高水平进行双级优化的估算,即根据专家政策保守模式(低水平)进行最高的可能性最大化;我们提议,通过不断的测算法,将数据比值比标准比标准比标准比标准比标准更远。最后,我们展示了大型的大型模拟和连续分析。