Imitation learning from human-provided demonstrations is a strong approach for learning policies for robot manipulation. While the ideal dataset for imitation learning is homogenous and low-variance -- reflecting a single, optimal method for performing a task -- natural human behavior has a great deal of heterogeneity, with several optimal ways to demonstrate a task. This multimodality is inconsequential to human users, with task variations manifesting as subconscious choices; for example, reaching down, then across to grasp an object, versus reaching across, then down. Yet, this mismatch presents a problem for interactive imitation learning, where sequences of users improve on a policy by iteratively collecting new, possibly conflicting demonstrations. To combat this problem of demonstrator incompatibility, this work designs an approach for 1) measuring the compatibility of a new demonstration given a base policy, and 2) actively eliciting more compatible demonstrations from new users. Across two simulation tasks requiring long-horizon, dexterous manipulation and a real-world "food plating" task with a Franka Emika Panda arm, we show that we can both identify incompatible demonstrations via post-hoc filtering, and apply our compatibility measure to actively elicit compatible demonstrations from new users, leading to improved task success rates across simulated and real environments.
翻译:从人类提供的演示中进行脱光学习是学习机器人操控政策的一种强有力的方法。虽然模拟学习的理想数据集是同质和低差异 -- -- 反映执行任务的单一最佳方法 -- -- 人类自然行为有许多异质性,有几种最佳的方法来展示任务。这种多式联运对于人类用户来说是无关紧要的,任务变异表现为潜意识选择;例如,向下,然后跨过一个对象,然后跨过一个对象,然后向下。然而,这种不匹配给互动模仿学习带来问题,即用户序列通过迭接收集新的、可能相互矛盾的演示来改进政策。为了对付这个演示人互不兼容的问题,这项工作设计了一种方法:(1) 测量根据基本政策进行的新演示的兼容性,和(2) 积极从新的用户那里获得更兼容性演示。在两个模拟任务之间需要长视网、极的操纵和真实世界的“食品拉拉链”任务之间,我们可以看到,通过后视镜过滤器的反复收集率来识别不相容的演示,并用新的模拟方法来积极测量我们的成功率。