Imitation learning learns a policy from expert trajectories. While the expert data is believed to be crucial for imitation quality, it was found that a kind of imitation learning approach, adversarial imitation learning (AIL), can have exceptional performance. With as little as only one expert trajectory, AIL can match the expert performance even in a long horizon, on tasks such as locomotion control. There are two mysterious points in this phenomenon. First, why can AIL perform well with only a few expert trajectories? Second, why does AIL maintain good performance despite the length of the planning horizon? In this paper, we theoretically explore these two questions. For a total-variation-distance-based AIL (called TV-AIL), our analysis shows a horizon-free imitation gap $\mathcal O(\{\min\{1, \sqrt{|\mathcal S|/N} \})$ on a class of instances abstracted from locomotion control tasks. Here $|\mathcal S|$ is the state space size for a tabular Markov decision process, and $N$ is the number of expert trajectories. We emphasize two important features of our bound. First, this bound is meaningful in both small and large sample regimes. Second, this bound suggests that the imitation gap of TV-AIL is at most 1 regardless of the planning horizon. Therefore, this bound can explain the empirical observation. Technically, we leverage the structure of multi-stage policy optimization in TV-AIL and present a new stage-coupled analysis via dynamic programming
翻译:光学学习从专家轨迹中学习政策。 虽然专家数据被认为对于模仿质量至关重要, 但专家数据被认为对模拟质量至关重要, 但发现一种模仿学习方法, 对抗模仿学习( AIL) 可能具有超乎寻常的性能 。 只要只有一条专家轨迹, AIL 就可以在远处与专家业绩匹配, 比如移动控制。 这一现象有两个神秘的点 。 首先, 为什么AIL 只能用少数专家轨迹来很好地运行? 其次, 为什么AIL 仍然保持良好的性能, 尽管规划范围很长? 在本文中, 我们理论上探讨这两个问题。 对于完全变换远程的AIL( 称为 TV- AIL) 来说, 我们的分析显示一个无地平面的模仿差距 $\ mathcal O( min) 1,\ sqrt\ mathcal S ⁇ /N} 。 在从移动控制任务中抽取的一类例子中, $_mathalal Sá $( $) 是用于表Mark Markov 决策过程中的州空间大小, 。 在第一阶段, 我们的缩缩缩缩缩分析中, 。