Behavior cloning of expert demonstrations can speed up learning optimal policies in a more sample-efficient way over reinforcement learning. However, the policy cannot extrapolate well to unseen states outside of the demonstration data, creating covariate shift (agent drifting away from demonstrations) and compounding errors. In this work, we tackle this issue by extending the region of attraction around the demonstrations so that the agent can learn how to get back onto the demonstrated trajectories if it veers off-course. We train a generative backwards dynamics model and generate short imagined trajectories from states in the demonstrations. By imitating both demonstrations and these model rollouts, the agent learns the demonstrated paths and how to get back onto these paths. With optimal or near-optimal demonstrations, the learned policy will be both optimal and robust to deviations, with a wider region of attraction. On continuous control domains, we evaluate the robustness when starting from different initial states unseen in the demonstration data. While both our method and other imitation learning baselines can successfully solve the tasks for initial states in the training distribution, our method exhibits considerably more robustness to different initial states.
翻译:专家示范活动的行为克隆可以比强化学习更快地以更具抽样效率的方式学习最佳政策。 但是,该政策无法在演示数据之外对隐蔽国家进行极好的外推,从而产生共变转移(从演示中漂移剂)和复合错误。 在这项工作中,我们通过扩大在示范活动周围的吸引区域来解决这一问题,以便代理人员能够学习如何回到演示轨迹上,如果它离开课程的话。我们训练了一个基因化的后向动态模型模型,并从演示中从各州产生短假想的轨迹。通过模仿演示和这些模型展出,代理人员学习了所展示的道路和如何回到这些路径上。通过最佳或接近最佳的演示,我们所学习的政策将既最理想又有力,可以绕过偏差,具有更大的吸引力。在持续的控制领域,我们评估从演示数据中看不见的不同初始状态开始时的稳健性。虽然我们的方法和其他模仿学习基线都能成功地解决培训分布中最初状态的任务,但我们的方法显示不同最初状态的稳健性。