我知道你的意思:通过(根据)评估他们的选择来学习人的目标 (I Know What You Meant: Learning Human Objectives by (Under)estimating Their Choice Set)

Assistive robots have the potential to help people perform everyday tasks. However, these robots first need to learn what it is their user wants them to do. Teaching assistive robots is hard for inexperienced users, elderly users, and users living with physical disabilities, since often these individuals are unable to show the robot their desired behavior. We know that inclusive learners should give human teachers credit for what they cannot demonstrate. But today's robots do the opposite: they assume every user is capable of providing any demonstration. As a result, these robots learn to mimic the demonstrated behavior, even when that behavior is not what the human really meant! Here we propose a different approach to reward learning: robots that reason about the user's demonstrations in the context of similar or simpler alternatives. Unlike prior works -- which err towards overestimating the human's capabilities -- here we err towards underestimating what the human can input (i.e., their choice set). Our theoretical analysis proves that underestimating the human's choice set is risk-averse, with better worst-case performance than overestimating. We formalize three properties to generate similar and simpler alternatives. Across simulations and a user study, our resulting algorithm better extrapolates the human's objective. See the user study here: https://youtu.be/RgbH2YULVRo

翻译：辅助机器人有帮助人们完成日常任务的潜力。然而, 这些机器人首先需要了解他们想要的是什么。教学辅助机器人对于缺乏经验的使用者、老年使用者和身体残疾的使用者来说很难, 因为这些人往往无法展示他们想要的机器人行为。我们知道, 包容性的学习者应该给人类教师以他们无法展示的东西的信用。但是今天的机器人却相反: 他们认为每个使用者都有能力提供任何演示。结果, 这些机器人学会模仿所显示的行为, 即使这种行为不是人类真正想要的。我们在这里提出了不同的奖励学习方法: 机器人, 机器人在类似或更简单的替代品中解释用户演示的原因。不同于以前的工作, 因为他们高估了人类的能力。我们在这里低估了人类能提供什么( 即他们的选择设置 ) 。我们的理论分析证明, 低估人类选择的一组是风险反向的, 其最坏的性能比高。我们在这里正式确定了三个属性来解释用户在类似或更简单的替代方法中演示。我们在这里确定三个属性, 来产生类似和更简单的用户的模型。。