When inferring reward functions from human behavior (be it demonstrations, comparisons, physical corrections, or e-stops), it has proven useful to model the human as making noisy-rational choices, with a "rationality coefficient" capturing how much noise or entropy we expect to see in the human behavior. Prior work typically sets the rationality level to a constant value, regardless of the type, or quality, of human feedback. However, in many settings, giving one type of feedback (e.g. a demonstration) may be much more difficult than a different type of feedback (e.g. answering a comparison query). Thus, we expect to see more or less noise depending on the type of human feedback. In this work, we advocate that grounding the rationality coefficient in real data for each feedback type, rather than assuming a default value, has a significant positive effect on reward learning. We test this in both simulated experiments and in a user study with real human feedback. We find that overestimating human rationality can have dire effects on reward learning accuracy and regret. We also find that fitting the rationality coefficient to human data enables better reward learning, even when the human deviates significantly from the noisy-rational choice model due to systematic biases. Further, we find that the rationality level affects the informativeness of each feedback type: surprisingly, demonstrations are not always the most informative -- when the human acts very suboptimally, comparisons actually become more informative, even when the rationality level is the same for both. Ultimately, our results emphasize the importance and advantage of paying attention to the assumed human-rationality level, especially when agents actively learn from multiple types of human feedback.
翻译:当从人类行为(无论是演示、比较、物理纠正,还是电子停止)中推断奖励功能时,用“合理系数”来显示我们期望在人类行为中看到多少噪音或诱变。 先前的工作通常将理性水平设定为常态价值, 不论人类反馈的类型或质量。 但是,在许多环境中, 提供一种类型的反馈(例如演示)可能比不同类型的反馈(例如,回答比较查询)要困难得多。 因此,我们期望根据人类反馈的类型来模拟人类作出吵闹的理性选择,并用“合理系数”来显示我们预期在人类行为中会看到多少噪音。 在这项工作中,我们主张将理性系数建立在每种反馈类型的真实数据中,而不是假设默认值,对奖励学习有着显著的积极积极的影响。 我们在模拟实验和用户研究中用真正的人类反馈来测试这一点。 我们发现,过高的理性度可能会对学习准确性和遗憾产生可怕的影响。 我们还发现,在人类的准确性和遗憾度上,我们的数据与理性程度相适应,特别是合理性系数水平使得人们能够更好地学习,在人类选择类型上更精确地学习。</s>