We study the following question in the context of imitation learning for continuous control: how are the underlying stability properties of an expert policy reflected in the sample-complexity of an imitation learning task? We provide the first results showing that a surprisingly granular connection can be made between the underlying expert system's incremental gain stability, a novel measure of robust convergence between pairs of system trajectories, and the dependency on the task horizon $T$ of the resulting generalization bounds. In particular, we propose and analyze incremental gain stability constrained versions of behavior cloning and a DAgger-like algorithm, and show that the resulting sample-complexity bounds naturally reflect the underlying stability properties of the expert system. As a special case, we delineate a class of systems for which the number of trajectories needed to achieve $\varepsilon$-suboptimality is sublinear in the task horizon $T$, and do so without requiring (strong) convexity of the loss function in the policy parameters. Finally, we conduct numerical experiments demonstrating the validity of our insights on both a simple nonlinear system for which the underlying stability properties can be easily tuned, and on a high-dimensional quadrupedal robotic simulation.
翻译:我们从模拟学习中研究下列问题,以便不断控制:模拟学习任务样本复杂性所反映的专家政策的基本稳定性特性如何?我们提供了初步结果,显示在基础专家系统递增增增增益稳定性、系统轨迹对任务地平线的高度趋同的新衡量标准、以及由此产生的概括界限对任务地平线的依赖性之间可以建立出奇的颗粒联系,我们特别提议并分析增益稳定性受行为克隆和像Dagger一样的算法限制的版本,并表明由此形成的样本兼容性界限自然反映了专家系统的基本稳定性特性。作为一个特例,我们划定了在任务地平线上实现美元和瓦列普西隆元的次优性所需的轨迹数量是亚线性的系统类别,而不需要(强)政策参数中损失函数的共性。最后,我们进行数字实验,以证明我们对简单非线性系统的认识的有效性,而基础的机器人性能的高度模型可以轻易地对之进行模拟。