相对行为属性:填补符号目标规格与从人类偏好中学习回报之间的差距</s> (Relative Behavioral Attributes: Filling the Gap between Symbolic Goal Specification and Reward Learning from Human Preferences)

Generating complex behaviors that satisfy the preferences of non-expert users is a crucial requirement for AI agents. Interactive reward learning from trajectory comparisons (a.k.a. RLHF) is one way to allow non-expert users to convey complex objectives by expressing preferences over short clips of agent behaviors. Even though this parametric method can encode complex tacit knowledge present in the underlying tasks, it implicitly assumes that the human is unable to provide richer feedback than binary preference labels, leading to intolerably high feedback complexity and poor user experience. While providing a detailed symbolic closed-form specification of the objectives might be tempting, it is not always feasible even for an expert user. However, in most cases, humans are aware of how the agent should change its behavior along meaningful axes to fulfill their underlying purpose, even if they are not able to fully specify task objectives symbolically. Using this as motivation, we introduce the notion of Relative Behavioral Attributes, which allows the users to tweak the agent behavior through symbolic concepts (e.g., increasing the softness or speed of agents' movement). We propose two practical methods that can learn to model any kind of behavioral attributes from ordered behavior clips. We demonstrate the effectiveness of our methods on four tasks with nine different behavioral attributes, showing that once the attributes are learned, end users can produce desirable agent behaviors relatively effortlessly, by providing feedback just around ten times. This is over an order of magnitude less than that required by the popular learning-from-human-preferences baselines. The supplementary video and source code are available at: https://guansuns.github.io/pages/rba.

翻译：产生满足非专家用户偏好的复杂行为是AI 代理商的关键要求。通过轨迹比较(a.k.a.a.a.RLHF) 进行互动奖励学习(通过轨迹比较(a.k.a.a.a.RLHF) 是一种方法,让非专家用户通过表达对代理人行为短片的偏好来传达复杂的目标。尽管这种参数方法可以将基本任务中存在的复杂的隐性知识编码化,但它隐含地假定,人类无法提供比二进制偏好标签更丰富的反馈,从而导致反馈的复杂性和用户经验差。虽然对目标提供详细的象征性的闭装规格可能很诱人,但即使是专家用户也并不总是可行。然而,在多数情况下,人们知道该代理人应该如何改变其行为来达到其基本目的,即使他们不能以象征性的方式充分指定任务目标,但是,它也暗地假定,相对的Behavioral 属性概念使用户能够通过象征性的概念(例如,提高代理人运动的软度或速度) 。我们建议两种实用的方法,在实际方法中,我们从有意义的轴轴轴轴上, 展示了一种可以用来学习任何一种不同的行为的特性, 一旦我们可以展示任何不同的行为的排序, 的顺序,我们可以产生一种不同的行为的顺序,我们可以产生一种手法令的行为。</s>