相对行为属性:填补符号目标规格与从人类偏好中学习回报之间的差距 (Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences)

Generating complex behaviors from goals specified by non-expert users is a crucial aspect of intelligent agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide rich-form feedback other than binary preference labels, leading to extremely high feedback complexity and poor user experience. While providing a detailed symbolic specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill the underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which acts as a middle ground, between exact goal specification and reward learning purely from preference labels, by enabling the users to tweak the agent's behavior through nameable concepts (e.g., increasing the softness of the movement of a two-legged "sneaky" agent). We propose two different parametric methods that can potentially encode any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on 4 tasks with 9 different behavioral attributes and show that once the attributes are learned, end users can effortlessly produce desirable agent behaviors, by providing feedback just around 10 times. The feedback complexity of our approach is over 10 times less than the learning-from-human-preferences baseline and this demonstrates that our approach is readily applicable in real-world applications.

翻译：从非专家用户指定的目标中产生复杂的行为是智能剂的一个关键方面。通过轨迹比较进行互动奖励学习是让非专家用户通过表达对代理人行为短片的偏好来传达复杂目标的一种方式。尽管这种方法可以将基础任务中存在的复杂的隐性知识编码为隐含的复杂知识,但它隐含地假定,除了二进制偏好标签之外,人类无法提供丰富形式的反馈,导致极高的反馈复杂性和糟糕的用户经验。虽然对目标提供详细的象征性说明可能很诱人,但即使是专家用户也并不总是可行的。然而,在多数情况下,人类知道非专家用户如何通过有意义的轴来改变其行为以达到基本目的,即使他们不能以象征性的方式完全指定任务目标。尽管这种方法可以将隐含的隐性知识混入其中,我们引入了相对行为属性的概念,在精确的目标规格和纯粹从偏爱标签中学习的奖赏之间,使用户能够通过可点名化的概念(例如,提高该代理人行为在有意义的轴上的行为动作运动的柔软性性,在两种类型上展示了我们的行为的特性上的等级,在排序上展示了我们的行为的等级的等级的等级方法,在4的等级上展示了我们的任何方法。