学习从体育教管序列中学习人的目标 (Learning Human Objectives from Sequences of Physical Corrections)

When personal, assistive, and interactive robots make mistakes, humans naturally and intuitively correct those mistakes through physical interaction. In simple situations, one correction is sufficient to convey what the human wants. But when humans are working with multiple robots or the robot is performing an intricate task often the human must make several corrections to fix the robot's behavior. Prior research assumes each of these physical corrections are independent events, and learns from them one-at-a-time. However, this misses out on crucial information: each of these interactions are interconnected, and may only make sense if viewed together. Alternatively, other work reasons over the final trajectory produced by all of the human's corrections. But this method must wait until the end of the task to learn from corrections, as opposed to inferring from the corrections in an online fashion. In this paper we formalize an approach for learning from sequences of physical corrections during the current task. To do this we introduce an auxiliary reward that captures the human's trade-off between making corrections which improve the robot's immediate reward and long-term performance. We evaluate the resulting algorithm in remote and in-person human-robot experiments, and compare to both independent and final baselines. Our results indicate that users are best able to convey their objective when the robot reasons over their sequence of corrections.

翻译：当个人、辅助和互动机器人犯错误时,人类自然地和直觉地通过物理互动纠正错误。在简单的情况下,一个纠正就足以传达人类想要的东西。但是当人类与多个机器人或机器人一起工作时,如果机器人正在履行一项复杂的任务,人类往往必须作出若干更正来纠正机器人的行为。先前的研究假设这些物理纠正都是独立事件,并一次性从中学习。然而,这在关键信息上是错失的:这些相互作用都是相互联系的,只有一起看才有意义。或者,所有人类校正最终轨迹的其他工作原因足以传达出人类想要的东西。但是,这种方法必须等到任务结束才从更正中学习,而不是从网上的校正中推断出来。在本文中,我们正式确定了从当前任务期间的物理校正序列中学习的方法。为了做到这一点,我们引入了一种辅助性奖励,可以捕捉人类在改进机器人的近期奖励和长期性表现之间的交易。或者说,对于所有人类校正产生的最终轨迹,我们评估了远程和人际校正的结果,然后将最终的测算,然后将人类测算为最终的基线和最终的结果。