Recently, adversarial imitation learning has shown a scalable reward acquisition method for inverse reinforcement learning (IRL) problems. However, estimated reward signals often become uncertain and fail to train a reliable statistical model since the existing methods tend to solve hard optimization problems directly. Inspired by a first-order optimization method called mirror descent, this paper proposes to predict a sequence of reward functions, which are iterative solutions for a constrained convex problem. IRL solutions derived by mirror descent are tolerant to the uncertainty incurred by target density estimation since the amount of reward learning is regulated with respect to local geometric constraints. We prove that the proposed mirror descent update rule ensures robust minimization of a Bregman divergence in terms of a rigorous regret bound of $\mathcal{O}(1/T)$ for step sizes $\{\eta_t\}_{t=1}^{T}$. Our IRL method was applied on top of an adversarial framework, and it outperformed existing adversarial methods in an extensive suite of benchmarks.
翻译:最近,对抗性模拟学习展示了反强化学习(IRL)问题的一种可扩展的奖励获取方法。然而,估计的奖励信号往往变得不确定,并且没有训练可靠的统计模式,因为现有方法往往直接解决硬优化问题。在被称为镜底的一阶优化方法的启发下,本文件建议预测一系列奖励功能,这是制约锥形问题的迭代解决办法。镜底法对目标密度估计所产生的不确定性持容忍态度,因为奖励学习的数量受当地几何限制的制约。我们证明,拟议的镜底更新规则确保以$\mathcal{O}(1/T)的严格遗憾约束,最大限度地减少布雷格曼在步数上的差异。我们的IRL方法在对抗性框架的顶部应用,在广泛的一套基准中超过了现有的对抗方法。