When inferring reward functions from human behavior (be it demonstrations, comparisons, physical corrections, or e-stops), it has proven useful to model the human as making noisy-rational choices, with a "rationality coefficient" capturing how much noise or entropy we expect to see in the human behavior. Many existing works have opted to fix this coefficient regardless of the type, or quality, of human feedback. However, in some settings, giving a demonstration may be much more difficult than answering a comparison query. In this case, we should expect to see more noise or suboptimality in demonstrations than in comparisons, and should interpret the feedback accordingly. In this work, we advocate that grounding the rationality coefficient in real data for each feedback type, rather than assuming a default value, has a significant positive effect on reward learning. We test this in experiments with both simulated feedback, as well a user study. We find that when learning from a single feedback type, overestimating human rationality can have dire effects on reward accuracy and regret. Further, we find that the rationality level affects the informativeness of each feedback type: surprisingly, demonstrations are not always the most informative -- when the human acts very suboptimally, comparisons actually become more informative, even when the rationality level is the same for both. Moreover, when the robot gets to decide which feedback type to ask for, it gets a large advantage from accurately modeling the rationality level of each type. Ultimately, our results emphasize the importance of paying attention to the assumed rationality level, not only when learning from a single feedback type, but especially when agents actively learn from multiple feedback types.
翻译:当从人类行为(无论是演示、比较、物理纠正还是电子停止)中推断奖励功能时,用“合理系数”来衡量人类行为中预期会看到多少噪音或诱变。许多现有作品选择用人类行为来修正这个系数,而不管人类的反馈类型或质量如何。然而,在某些环境下,演示可能比回答比较查询要困难得多。在这种情况下,我们应该看到演示中比比较中更多的噪音或不最优化的反馈,并据此解释反馈。在这项工作中,我们主张将理性系数以真实数据作为每个反馈类型中的基础,而不是假设默认值,这对奖励学习有显著的积极积极的影响。我们用模拟反馈和用户研究来试验这个系数。我们发现,在从单一反馈类型中学习时,高估人类合理性可能会对奖励的准确性和遗憾产生可怕的影响。此外,我们发现理性水平影响每个反馈类型的信息的准确性水平:甚至令人惊讶的是,在每种反馈类型中,演示的理性系数都不是最准确的,在每次比较中,当人们的行为都是最精确的排序时,而最精确的排序时,当人们的行为都是一种最精确地决定了。