相对行为属性:填补符号目标规格与从人类偏好中学习回报之间的差距 (Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences)

Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement on AI agents. Interactive reward learning from trajectory comparisons is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines.

翻译：满足非专家用户偏好的复杂行为是AI 代理商的关键要求。从轨迹比较中进行互动奖励学习是让非专家用户通过表达对代理人行为的短片的偏好来传达复杂目标的一种方法。尽管这种参数方法可以将基本任务中存在的复杂的隐含知识编码为隐含的复杂知识,但它隐含地假定,人类无法提供比二进制偏好标签更丰富的反馈,从而导致无法令人容忍的高反馈复杂性和不良的用户经验。虽然对目标提供详细的象征性封闭式规格可能诱人,但即使专家用户也不一定可行。然而,在多数情况下,人类都意识到该代理商应当如何通过有意义的轴心来改变其行为,以达到其基本目的,即使他们不能以象征性的方式充分指定任务目标。我们以此为动机引入相对的“行为属性”的概念,使用户能够通过象征性概念(例如,提高代理人运动的软性或速度)对代理商行为进行干扰。我们建议了两种实用的方法,可以学习任何行为属性的模型,从定型的轴轴线上改变其行为,即使他们不能以相对的行为动作剪裁。我们用四种方式展示了要求的动作的动作的动作的特性,我们可以展示了一种效率。我们通过学习方式在四种不同的动作上的动作上展示了一种不同的动作的特性。我们通过学习方式,我们可以产生一种不精度。