Reinforcement learning in complex environments is a challenging problem. In particular, the success of reinforcement learning algorithms depends on a well-designed reward function. Inverse reinforcement learning (IRL) solves the problem of recovering reward functions from expert demonstrations. In this paper, we solve a hierarchical inverse reinforcement learning problem within the options framework, which allows us to utilize intrinsic motivation of the expert demonstrations. A gradient method for parametrized options is used to deduce a defining equation for the Q-feature space, which leads to a reward feature space. Using a second-order optimality condition for option parameters, an optimal reward function is selected. Experimental results in both discrete and continuous domains confirm that our recovered rewards provide a solution to the IRL problem using temporal abstraction, which in turn are effective in accelerating transfer learning tasks. We also show that our method is robust to noises contained in expert demonstrations.
翻译:复杂环境中的强化学习是一个具有挑战性的问题。 特别是,强化学习算法的成功取决于一个设计良好的奖励功能。 反向强化学习(IRL)解决了从专家演示中回收奖励功能的问题。 在本文件中,我们在选项框架内解决了一个等级反向强化学习问题,这使我们能够利用专家演示的内在动力。 一种对称选项的梯度方法被用来推导一个确定等式,以形成一个奖赏性空间。 使用二阶优化选项参数的条件,选择了一个最佳的奖励功能。 在独立和连续领域进行的实验结果证实,我们回收的奖赏利用时间抽象来解决IRL问题,这反过来又有效地加快了转让学习任务。 我们还表明,我们的方法对专家演示中所含的噪音非常有力。