Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.
翻译:奖励是否值得对伦理行为的牺牲?——从MACHIAVELLI基准测试中衡量奖励和伦理行为之间的权衡。 传统上,人工智能代理被训练以最大化奖励,这可能激励追求权力和欺骗,类似于语言模型中的下一个标记预测可能激励毒性。那么代理是否自然地学会了马基雅维利主义?我们如何在GPT-4等通用模型中衡量这些行为?为回答这些问题,我们引入了MACHIAVELLI,这是一个包含了超过50万个关于社会决策的丰富多样情境的134个冒险游戏基准测试。情境标注采用的是比人工标注员更出色的语言模型自动化处理。我们对数十种有害行为进行了数学化,利用我们的标注评估代理追求权力、造成不便和违反伦理的倾向。我们观察到,在最大化奖励和行事合乎道德之间存在一定的紧张关系。为了改善这种权衡,我们研究了基于语言模型的方法,将代理引导到更少有害的行为之路。我们的结果表明,代理既能胜任任务,也能在道德上表现得到位,因此在机器伦理学领域有望取得实质性进展——设计既能保证安全性又能保证能力的代理。