Reward learning enables robots to learn adaptable behaviors from human input. Traditional methods model the reward as a linear function of hand-crafted features, but that requires specifying all the relevant features a priori, which is impossible for real-world tasks. To get around this issue, recent deep Inverse Reinforcement Learning (IRL) methods learn rewards directly from the raw state but this is challenging because the robot has to implicitly learn the features that are important and how to combine them, simultaneously. Instead, we propose a divide and conquer approach: focus human input specifically on learning the features separately, and only then learn how to combine them into a reward. We introduce a novel type of human input for teaching features and an algorithm that utilizes it to learn complex features from the raw state space. The robot can then learn how to combine them into a reward using demonstrations, corrections, or other reward learning frameworks. We demonstrate our method in settings where all features have to be learned from scratch, as well as where some of the features are known. By first focusing human input specifically on the feature(s), our method decreases sample complexity and improves generalization of the learned reward over a deepIRL baseline. We show this in experiments with a physical 7DOF robot manipulator, as well as in a user study conducted in a simulated environment.
翻译:奖励学习让机器人能够从人的投入中学习适应性的行为。 传统方法将奖励作为手工制作的特征的线性功能来模型, 但是这需要先验性地具体说明所有相关特征, 这是现实世界的任务所不可能的。 为了解决这个问题, 最近深深反强化学习(IRL)方法直接从原始状态中学习奖励, 但是这具有挑战性, 因为机器人必须隐含地学习重要特征, 并同时学习如何结合这些特征。 相反, 我们提议了一种分化和征服方法: 将人类投入具体侧重于分别学习这些特征, 然后再学习如何将它们结合到奖励中。 我们引入了一种新型的人类投入, 用于教学特征和算法, 利用它从原始状态中学习复杂的特征。 然后机器人可以学习如何用演示、 校正或其他奖赏学习框架将这些特征整合成奖励。 我们展示了我们在所有特征都必须从零开始学习, 以及某些特征已知的环境下的方法。 我们首先将人类投入集中在特性上, 我们的方法会降低样本的复杂性, 并改进在深度IRL 7 模型中学习的奖赏, 我们用在用户的模型中, 实验中, 。