In the learning from demonstration (LfD) paradigm, understanding and evaluating the demonstrated behaviors plays a critical role in extracting control policies for robots. Without this knowledge, a robot may infer incorrect reward functions that lead to undesirable or unsafe control policies. Recent work has proposed an LfD framework where a user provides a set of formal task specifications to guide LfD, to address the challenge of reward shaping. However, in this framework, specifications are manually ordered in a performance graph (a partial order that specifies relative importance between the specifications). The main contribution of this paper is an algorithm to learn the performance graph directly from the user-provided demonstrations, and show that the reward functions generated using the learned performance graph generate similar policies to those from manually specified performance graphs. We perform a user study that shows that priorities specified by users on behaviors in a simulated highway driving domain match the automatically inferred performance graph. This establishes that we can accurately evaluate user demonstrations with respect to task specifications without expert criteria.
翻译:在从示范(LfD)范式中学习,理解和评估所显示的行为在为机器人制定控制政策方面发挥着关键作用。没有这种知识,机器人可以推断出不正确的奖励功能,导致不受欢迎的或不安全的控制政策。最近的工作提出了一个LfD框架,其中用户提供一套正式的任务规格来指导LfD,以应对奖赏塑造的挑战。然而,在这个框架中,在性能图中手工订购规格(一个部分顺序,具体指明规格之间的相对重要性)。本文的主要贡献是直接从用户提供的演示中学习性能图的算法,并显示用学习过的性能图产生的奖励功能产生与手动指定的性能图中类似的政策。我们进行用户研究,显示用户在模拟高速公路驱动域的行为中指定的优先事项与自动推断的性能图相匹配。这证明我们可以准确评估用户在不按专家标准进行的任务规格的演示。