Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring a reward function from expert demonstrations. Many IRL algorithms require a known transition model and sometimes even a known expert policy, or they at least require access to a generative model. However, these assumptions are too strong for many real-world applications, where the environment can be accessed only through sequential interaction. We propose a novel IRL algorithm: Active exploration for Inverse Reinforcement Learning (AceIRL), which actively explores an unknown environment and expert policy to quickly learn the expert's reward function and identify a good policy. AceIRL uses previous observations to construct confidence intervals that capture plausible reward functions and find exploration policies that focus on the most informative regions of the environment. AceIRL is the first approach to active IRL with sample-complexity bounds that does not require a generative model of the environment. AceIRL matches the sample complexity of active IRL with a generative model in the worst case. Additionally, we establish a problem-dependent bound that relates the sample complexity of AceIRL to the suboptimality gap of a given IRL problem. We empirically evaluate AceIRL in simulations and find that it significantly outperforms more naive exploration strategies.
翻译:反强化学习(IRL)是从专家演示中推断奖赏功能的强大范例。 许多IRL算法需要已知的过渡模式,有时甚至已知的专家政策,或者至少需要获得基因模型。然而,这些假设对于许多现实世界应用程序来说过于强烈,环境只能通过相继互动才能进入。我们提出了一个新的IRL算法:积极探索反强化学习(AceIRL),积极探索一种未知的环境和专家政策,以迅速学习专家的奖赏功能并确定一个良好的政策。AceIRL利用先前的观察来建立信任间隔,以捕捉可信的奖赏功能,并找到以环境信息最丰富的区域为重点的勘探政策。AceIRL是主动的IRL的首个方法,其样本复杂度并不要求有环境的基因模型。AceIRL将活跃的IRL的样本复杂性与最坏的基因模型相匹配。此外,我们建立了一个与问题相关的问题相关联的界限,将AceIRL的样本复杂性与AIRL的亚优性研究战略联系起来。我们评估了ARL的模型。