Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement on AI agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines. The supplementary video and source code are available at: https://guansuns.github.io/pages/rba.
翻译:符合非专家用户偏好的复杂行为是AI 代理商的关键要求。 从轨迹比较中进行互动奖励学习是让非专家用户通过表达对代理人行为短片的偏好来传达复杂目标的一种方式。 尽管这种参数方法可以将基本任务中存在的复杂的隐含知识编码为隐含的复杂知识,但它隐含地假定,人类无法提供比二进制偏好标签更丰富的反馈,从而导致反馈的复杂性和用户经验差。虽然对目标提供详细的象征性封闭式规格可能很诱人,但即使是专家用户也并非总可行。 然而,在大多数情况下,人类都知道该代理商应该如何改变其在有意义的轴心上的行为,以达到其基本目的,即使他们不能以象征性的方式充分说明任务目标中存在的隐含的复杂知识。我们以此为动机,引入了相对行为属性的概念,使用户能够通过象征性概念(例如,提高民众视频动作的软度或速度)来改变代理人的行为。我们提出了两种实用的方法,可以用来学习任何行为特征的模型,在有意义的轴心轴上改变其行为,在9进式的行为动作中,我们展示了一种效率。我们通过学习了一种不同层次的动力的源,可以产生一种不易变的特性。我们可以产生一种不易的特性。我们所学的排序。我们用的方法可以产生一种不易懂的特性, 能够产生一种不易变。我们所学的动作的特性。