When robots enter everyday human environments, they need to understand their tasks and how they should perform those tasks. To encode these, reward functions, which specify the objective of a robot, are employed. However, designing reward functions can be extremely challenging for complex tasks and environments. A promising approach is to learn reward functions from humans. Recently, several robot learning works embrace this approach and leverage human demonstrations to learn the reward functions. Known as inverse reinforcement learning, this approach relies on a fundamental assumption: humans can provide near-optimal demonstrations to the robot. Unfortunately, this is rarely the case: human demonstrations to the robot are often suboptimal due to various reasons, e.g., difficulty of teleoperation, robot having high degrees of freedom, or humans' cognitive limitations. This thesis is an attempt towards learning reward functions from human users by using other, more reliable data modalities. Specifically, we study how reward functions can be learned using comparative feedback, in which the human user compares multiple robot trajectories instead of (or in addition to) providing demonstrations. To this end, we first propose various forms of comparative feedback, e.g., pairwise comparisons, best-of-many choices, rankings, scaled comparisons; and describe how a robot can use these various forms of human feedback to infer a reward function, which may be parametric or non-parametric. Next, we propose active learning techniques to enable the robot to ask for comparison feedback that optimizes for the expected information that will be gained from that user feedback. Finally, we demonstrate the applicability of our methods in a wide variety of domains, ranging from autonomous driving simulations to home robotics, from standard reinforcement learning benchmarks to lower-body exoskeletons.
翻译:当机器人进入日常的人类环境时,他们需要理解他们的任务和如何执行这些任务。为了对这些功能进行编码,他们需要理解他们的任务和他们应该如何执行这些任务。为了对这些功能进行编码,则使用奖励功能(指定机器人的目标),但是,设计奖励功能对于复杂的任务和环境来说可能极具挑战性。一个有希望的方法是学习人类的奖励功能。最近,一些机器人学习工作采用这种方法,利用人类的演示来学习奖励功能。作为反向学习,这种方法基于一个基本假设:人类可以向机器人提供接近最优化的演示。不幸的是,这种情况很少发生:由于各种原因,人类对机器人的演示往往不是最优的,例如,远程操作的难度、拥有高度自由度的机器人或人类认知限制。这是一个尝试通过使用其他更可靠的数据模式来学习奖励功能。具体地说,我们研究如何利用比较反馈来学习奖励功能,其中人类用户比较多种标准的机器人轨迹,而不是(或)不同的地标。我们提供演示的方法。我们为此首先提出不同形式的比较性反馈形式,例如,进行不进行比较,例如,进行内部的难度比较,比如, 比较,比较,然后分析,然后推算算算算算算, 人类的,然后推算,然后推算算算算算算算算算算。