Reward design in reinforcement learning (RL) is challenging since specifying human notions of desired behavior may be difficult via reward functions or require many expert demonstrations. Can we instead cheaply design rewards using a natural language interface? This paper explores how to simplify reward design by prompting a large language model (LLM) such as GPT-3 as a proxy reward function, where the user provides a textual prompt containing a few examples (few-shot) or a description (zero-shot) of the desired behavior. Our approach leverages this proxy reward function in an RL framework. Specifically, users specify a prompt once at the beginning of training. During training, the LLM evaluates an RL agent's behavior against the desired behavior described by the prompt and outputs a corresponding reward signal. The RL agent then uses this reward to update its behavior. We evaluate whether our approach can train agents aligned with user objectives in the Ultimatum Game, matrix games, and the DealOrNoDeal negotiation task. In all three tasks, we show that RL agents trained with our framework are well-aligned with the user's objectives and outperform RL agents trained with reward functions learned via supervised learning
翻译:强化学习( RL) 的奖赏设计具有挑战性, 因为通过奖赏功能或需要许多专家演示, 确定人类理想行为的概念可能很难。 我们能否用自然语言界面来廉价设计奖赏? 本文探索如何简化奖赏设计, 推广大型语言模型( LLM), 如GPT-3, 作为一种代理奖赏功能, 用户提供文本提示, 包含一些例子( few-shot) 或预期行为的描述( 0-shot) 。 我们的方法在RL 框架内利用了这个代理奖赏功能。 具体地说, 用户在培训开始时就指定了一次提示。 培训期间, LLM 评估一个RL 代理的行为与一个快速和产出描述的预期行为, 相应的奖赏信号。 RL 代理随后使用这一奖赏来更新其行为。 我们评估我们的方法能否培训代理人与Ultimtartum Game、 矩阵游戏和 ProtorNoDeal 谈判任务中的用户目标相一致。 我们在所有三项任务中显示, 受培训的RL 代理与我们框架培训的系统的目标完全吻一致。</s>