从不同的人类反馈来源获得的学习回授功能:最佳结合示范和偏好 (Learning Reward Functions from Diverse Sources of Human Feedback: Optimally Integrating Demonstrations and Preferences)

Reward functions are a common way to specify the objective of a robot. As designing reward functions can be extremely challenging, a more promising approach is to directly learn reward functions from human teachers. Importantly, data from human teachers can be collected either passively or actively in a variety of forms: passive data sources include demonstrations, (e.g., kinesthetic guidance), whereas preferences (e.g., comparative rankings) are actively elicited. Prior research has independently applied reward learning to these different data sources. However, there exist many domains where multiple sources are complementary and expressive. Motivated by this general problem, we present a framework to integrate multiple sources of information, which are either passively or actively collected from human users. In particular, we present an algorithm that first utilizes user demonstrations to initialize a belief about the reward function, and then actively probes the user with preference queries to zero-in on their true reward. This algorithm not only enables us combine multiple data sources, but it also informs the robot when it should leverage each type of information. Further, our approach accounts for the human's ability to provide data: yielding user-friendly preference queries which are also theoretically optimal. Our extensive simulated experiments and user studies on a Fetch mobile manipulator demonstrate the superiority and the usability of our integrated framework.

翻译：奖赏功能是指定机器人目标的常见方式。设计奖赏功能可能极具挑战性, 更有希望的方法是直接从人类教师那里直接学习奖赏功能。重要的是, 人类教师的数据可以被动或积极地以各种形式收集: 被动数据源包括演示( 诸如, 运动美学指导 ), 而偏好( 例如, 比较排名 ) 是积极获得的。先前的研究已经独立地应用奖励学习这些不同数据源。但是, 在许多领域, 多种来源是互补和表达的。受这个普遍问题驱使, 我们提出了一个框架, 整合多种信息来源, 它们是被动或积极地从人类用户那里收集的。特别是, 我们提出一种算法, 首先是利用用户演示来开始对奖赏功能的信念, 然后积极调查用户的偏好询问, 其真正的奖赏是零。这种算法不仅使我们能够将多个数据源组合起来, 而且还能让机器人知道它何时应该利用每一种类型的信息。此外, 我们的方法描述人类提供数据的能力: 提供用户友好的优待查询, 也是在理论上上最优等的用户的机率性框架。我们的模拟的模拟的模拟的用户实验和机能性研究。