From the earliest years of our lives, humans use language to express our beliefs and desires. Being able to talk to artificial agents about our preferences would thus fulfill a central goal of value alignment. Yet today, we lack computational models explaining such language use. To address this challenge, we formalize learning from language in a contextual bandit setting and ask how a human might communicate preferences over behaviors. We study two distinct types of language: $\textit{instructions}$, which provide information about the desired policy, and $\textit{descriptions}$, which provide information about the reward function. We show that the agent's degree of autonomy determines which form of language is optimal: instructions are better in low-autonomy settings, but descriptions are better when the agent will need to act independently. We then define a pragmatic listener agent that robustly infers the speaker's reward function by reasoning about $\textit{how}$ the speaker expresses themselves. We validate our models with a behavioral experiment, demonstrating that (1) our speaker model predicts human behavior, and (2) our pragmatic listener successfully recovers humans' reward functions. Finally, we show that this form of social learning can integrate with and reduce regret in traditional reinforcement learning. We hope these insights facilitate a shift from developing agents that $\textit{obey}$ language to agents that $\textit{learn}$ from it.
翻译:从我们生命的最初几年起,人类就用语言表达我们的信仰和愿望。 能够与人工代理商谈论我们的偏好, 从而实现价值调整的中心目标。 然而今天, 我们缺乏解释这种语言使用的计算模型。 为了应对这一挑战, 我们正式在背景强盗环境中从语言中学习, 并询问人类如何通过行为交流偏好。 我们研究两种不同的语言类型: $\ textit{ instruction}$, 提供关于理想政策的信息, 和 $\ textit{ descrition}$, 提供关于奖励功能的信息。 我们显示, 代理商的自主程度决定了哪种语言是最佳的: 在低自发性环境中, 指导是更好的, 但是当代理人需要独立行动时, 描述会更好。 然后我们定义一个务实的倾听代理商, 通过解释$\ text{how} 来强有力地推导出演讲者的奖赏功能。 我们用一种行为实验来验证我们的模型, 表明, (1) 我们的演讲人模型预测了人的行为, 和 (2) 我们的务实倾听者成功地恢复了美元形式的语言: 在低自传统感官的学习过程中, 我们展示了这种感化的感官的感官能 。