Learning to perform tasks by leveraging a dataset of expert observations, also known as imitation learning from observations (ILO), is an important paradigm for learning skills without access to the expert reward function or the expert actions. We consider ILO in the setting where the expert and the learner agents operate in different environments, with the source of the discrepancy being the transition dynamics model. Recent methods for scalable ILO utilize adversarial learning to match the state-transition distributions of the expert and the learner, an approach that becomes challenging when the dynamics are dissimilar. In this work, we propose an algorithm that trains an intermediary policy in the learner environment and uses it as a surrogate expert for the learner. The intermediary policy is learned such that the state transitions generated by it are close to the state transitions in the expert dataset. To derive a practical and scalable algorithm, we employ concepts from prior work on estimating the support of a probability distribution. Experiments using MuJoCo locomotion tasks highlight that our method compares favorably to the baselines for ILO with transition dynamics mismatch.
翻译:通过利用专家观察数据集(也称为从观察中学习(劳工组织))来学习任务,这是学习技能的一个重要范例,没有利用专家奖励职能或专家行动。我们认为,劳工组织是在专家和学习者代理人在不同环境中运作的环境下,差异的根源是过渡动态模型。可扩展的劳工组织最近采用的方法利用对抗性学习来匹配专家和学习者的国家-过渡分布。在动态不同时,这种方法变得具有挑战性。在这项工作中,我们提议一种算法,在学习者环境中培训中间政策,并将其作为学习者的一个代理专家。中间政策学到了这样的情况,即专家和学习者在不同的环境中工作,这种转变与专家数据集中的国家过渡相近。为了得出实用和可扩展的算法,我们从先前的工作中采用了概念来估计概率分布的支持。使用MuJoCo Locomotion任务进行的实验突出表明,我们的方法优于劳工组织的基线,与过渡动态不匹配。