Imitation learning (IL) is a popular paradigm for training policies in robotic systems when specifying the reward function is difficult. However, despite the success of IL algorithms, they impose the somewhat unrealistic requirement that the expert demonstrations must come from the same domain in which a new imitator policy is to be learned. We consider a practical setting, where (i) state-only expert demonstrations from the real (deployment) environment are given to the learner, (ii) the imitation learner policy is trained in a simulation (training) environment whose transition dynamics is slightly different from the real environment, and (iii) the learner does not have any access to the real environment during the training phase beyond the batch of demonstrations given. Most of the current IL methods, such as generative adversarial imitation learning and its state-only variants, fail to imitate the optimal expert behavior under the above setting. By leveraging insights from the Robust reinforcement learning (RL) literature and building on recent adversarial imitation approaches, we propose a robust IL algorithm to learn policies that can effectively transfer to the real environment without fine-tuning. Furthermore, we empirically demonstrate on continuous-control benchmarks that our method outperforms the state-of-the-art state-only IL method in terms of the zero-shot transfer performance in the real environment and robust performance under different testing conditions.
翻译:在具体规定奖励功能时,模拟学习(IL)是机器人系统培训政策的一个流行范例,尽管IL算法取得成功,但是,这种模式对机器人系统的培训政策却很困难;然而,尽管IL算法取得了成功,它们却强加了某种有点不切实际的要求,即专家示范必须来自需要学习新仿照政策的同一领域;我们考虑一种实际的环境,即(一) 向学习者提供来自实际(部署)环境的仅国有专家示范;(二) 模拟学习者政策是在模拟(培训)环境中培训的,其过渡动态与实际环境略有不同;(三) 学习者在培训阶段除了提供的一系列演示之外,没有任何机会进入真实的环境;目前IL方法的大多数,例如基因化的对抗模仿学习及其国有的变式,未能模仿上述环境中的最佳专家行为;通过利用Robuust 强化学习(RL)文学的见解和最近的对抗性模仿方法,我们建议采用强有力的IL算法,学习能够有效地向实际环境转让政策,而不作微调。此外,我们在不断控制方法的零测试中,我们用实际方法向环境转移标准。