When a person is not satisfied with how a robot performs a task, they can intervene to correct it. Reward learning methods enable the robot to adapt its reward function online based on such human input, but they rely on handcrafted features. When the correction cannot be explained by these features, recent work in deep Inverse Reinforcement Learning (IRL) suggests that the robot could ask for task demonstrations and recover a reward defined over the raw state space. Our insight is that rather than implicitly learning about the missing feature(s) from demonstrations, the robot should instead ask for data that explicitly teaches it about what it is missing. We introduce a new type of human input in which the person guides the robot from states where the feature being taught is highly expressed to states where it is not. We propose an algorithm for learning the feature from the raw state space and integrating it into the reward function. By focusing the human input on the missing feature, our method decreases sample complexity and improves generalization of the learned reward over the above deep IRL baseline. We show this in experiments with a physical 7DOF robot manipulator, as well as in a user study conducted in a simulated environment.
翻译:当一个人对机器人如何执行任务不满意时,他们可以进行干预来纠正任务。奖励学习方法使机器人能够根据这种人文输入来调整其奖励功能,但是他们依赖手工制作的特征。当这些特征无法解释纠正时,深反强化学习(IRL)最近的工作表明,机器人可以要求任务演示,并收回在原始状态空间上界定的奖励。我们的见解是,机器人应该要求获得明确教导其缺失内容的数据,而不是隐含地了解演示所缺失的特征。我们引入了一种新的人类输入,即人文输入将机器人从所教授特征高度表现的状态引导到非状态的状态。我们提出了从原始状态学习该特征并将其纳入奖励功能的算法。通过将人类输入的重点放在缺失特征上,我们的方法降低了样本的复杂性,并改进了从上述深度IRL基线上学到的奖赏的概观。我们用物理 7DOF机器人操纵器进行实验时,以及在模拟环境中进行的用户研究中展示了这一点。