Online imitation learning (IL) is an algorithmic framework that leverages interactions with expert policies for efficient policy optimization. Here policies are optimized by performing online learning on a sequence of loss functions that encourage the learner to mimic expert actions, and if the online learning has no regret, the agent can provably learn an expert-like policy. Online IL has demonstrated empirical successes in many applications and interestingly, its policy improvement speed observed in practice is usually much faster than existing theory suggests. In this work, we provide an explanation of this phenomenon. Let $\xi$ denote the policy class bias and assume the online IL loss functions are convex, smooth, and non-negative. We prove that, after $N$ rounds of online IL with stochastic feedback, the policy improves in $\tilde{O}(1/N + \sqrt{\xi/N})$ in both expectation and high probability. In other words, we show that adopting a sufficiently expressive policy class in online IL has two benefits: both the policy improvement speed increases and the performance bias decreases.
翻译:在线模拟学习( IL) 是一个逻辑框架,它能利用与专家政策的互动来提高效率政策优化。 这里的政策通过在鼓励学习者模仿专家行动的损失函数序列上进行在线学习而得到优化, 如果在线学习没有遗憾, 代理商可以顺利地学习专家类政策。 在线IL在许多应用中展示了经验成功, 令人感兴趣的是, 在实践中观察到的政策改进速度通常比现有理论所显示的要快得多。 在这项工作中, 我们给出了对这一现象的解释。 让我们用$xx$来表示政策阶级的偏向, 并假设在线 IL损失函数是顺畅的和非负的。 我们证明, 在用随机反馈进行一回合的在线 IL 后, 政策在预期值和高概率两方面都得到了改善 $tilde{O} (1/ N +\ qrtxxxi/N} 。 。 换句话说, 我们表明, 在在线 IL 中采用一个足够清晰的政策分类有两种好处: 政策改进速度和业绩偏差的下降 。