Inverse Reinforcement Learning (IRL) is a powerful paradigm for inferring a reward function from expert demonstrations. Many IRL algorithms require a known transition model and sometimes even a known expert policy, or they at least require access to a generative model. However, these assumptions are too strong for many real-world applications, where the environment can be accessed only through sequential interaction. We propose a novel IRL algorithm: Active exploration for Inverse Reinforcement Learning (AceIRL), which actively explores an unknown environment and expert policy to quickly learn the expert's reward function and identify a good policy. AceIRL uses previous observations to construct confidence intervals that capture plausible reward functions and find exploration policies that focus on the most informative regions of the environment. AceIRL is the first approach to active IRL with sample-complexity bounds that does not require a generative model of the environment. AceIRL matches the sample complexity of active IRL with a generative model in the worst case. Additionally, we establish a problem-dependent bound that relates the sample complexity of AceIRL to the suboptimality gap of a given IRL problem. We empirically evaluate AceIRL in simulations and find that it significantly outperforms more naive exploration strategies.
翻译:逆强化学习的主动探索
逆强化学习是一种从专家演示中推断出奖励函数的强大范例。许多IRL算法需要一个已知的转移模型,有时甚至需要一个已知的专家策略,或者至少要求访问生成模型。然而,这些假设对许多实际应用来说太强了,因为环境只能通过顺序交互来访问。本文提出一种新颖的IRL算法:逆强化学习的主动探索(AceIRL),该算法主动探索未知的环境和专家策略,快速学习专家的奖励函数并确定一个良好的策略。AceIRL利用先前的观测结果构建置信区间,捕捉合理的奖励函数,并找到探索策略,重点关注环境中最有信息的区域。AceIRL是第一个没有需要环境的生成模型,具有采样复杂度界限的主动IRL方法。AceIRL与具有生成模型的主动IRL在最坏情况下匹配样本复杂度。此外,我们建立了一个问题相关的界限,将AceIRL的样本复杂度与给定IRL问题的次优间隙联系起来。我们在模拟中对AceIRL进行了实证评估,并发现它显著优于更为简单的探索策略。