具有不同动态的模拟演示人学习可行性 (Learning Feasibility to Imitate Demonstrators with Different Dynamics)

The goal of learning from demonstrations is to learn a policy for an agent (imitator) by mimicking the behavior in the demonstrations. Prior works on learning from demonstrations assume that the demonstrations are collected by a demonstrator that has the same dynamics as the imitator. However, in many real-world applications, this assumption is limiting -- to improve the problem of lack of data in robotics, we would like to be able to leverage demonstrations collected from agents with different dynamics. This can be challenging as the demonstrations might not even be feasible for the imitator. Our insight is that we can learn a feasibility metric that captures the likelihood of a demonstration being feasible by the imitator. We develop a feasibility MDP (f-MDP) and derive the feasibility score by learning an optimal policy in the f-MDP. Our proposed feasibility measure encourages the imitator to learn from more informative demonstrations, and disregard the far from feasible demonstrations. Our experiments on four simulated environments and on a real robot show that the policy learned with our approach achieves a higher expected return than prior works. We show the videos of the real robot arm experiments on our website (https://sites.google.com/view/learning-feasibility).

翻译：从示威中学习的目的是通过模仿示威中的行为来学习一个代理人(模拟者)的政策。从示威中学习先前的工作假设示威是由与模仿者具有相同动态的示威者收集的。然而,在许多现实应用中,这一假设是限制性的 -- -- 以改善机器人缺乏数据的问题,我们希望能够利用从具有不同动态的代理人那里收集的演示。这可能具有挑战性,因为演示对模仿者来说可能甚至不可行。我们的洞察力是,我们可以学习一种可行性指标,以捕捉模拟者进行演示的可行性可能性。我们开发了一个可行性的MDP(f-MDP),并通过学习F-MDP的最佳政策来取得可行性分数。我们提议的可行性研究措施鼓励模拟者从更多的信息演示中学习,而远远忽视可行的演示。我们在四个模拟环境中的实验和在真实的机器人上显示,我们所学的政策比以前的工作得到更高的回报。我们在网站(https://sitesites.goglegle.com/view)上展示了真正的机器人手臂实验的视频。