Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark (Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark) - 专知论文

会员服务 ·

0

情境 · 标注 · 基准测试 · 语言模型 · 智能代理 ·

2023 年 4 月 6 日

Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

翻译：Do the Rewards Justify the Means? Measuring Trade-Offs Between Rewards and Ethical Behavior in the MACHIAVELLI Benchmark

Alexander Pan,Chan Jun Shern,Andy Zou,Nathaniel Li,Steven Basart,Thomas Woodside,Jonathan Ng,Hanlin Zhang,Scott Emmons,Dan Hendrycks

from arxiv, 31 pages, 5 figures

Artificial agents have traditionally been trained to maximize reward, which may incentivize power-seeking and deception, analogous to how next-token prediction in language models (LMs) may incentivize toxicity. So do agents naturally learn to be Machiavellian? And how do we measure these behaviors in general-purpose models such as GPT-4? Towards answering these questions, we introduce MACHIAVELLI, a benchmark of 134 Choose-Your-Own-Adventure games containing over half a million rich, diverse scenarios that center on social decision-making. Scenario labeling is automated with LMs, which are more performant than human annotators. We mathematize dozens of harmful behaviors and use our annotations to evaluate agents' tendencies to be power-seeking, cause disutility, and commit ethical violations. We observe some tension between maximizing reward and behaving ethically. To improve this trade-off, we investigate LM-based methods to steer agents' towards less harmful behaviors. Our results show that agents can both act competently and morally, so concrete progress can currently be made in machine ethics--designing agents that are Pareto improvements in both safety and capabilities.

翻译：奖励是否值得对伦理行为的牺牲？——从MACHIAVELLI基准测试中衡量奖励和伦理行为之间的权衡。传统上，人工智能代理被训练以最大化奖励，这可能激励追求权力和欺骗，类似于语言模型中的下一个标记预测可能激励毒性。那么代理是否自然地学会了马基雅维利主义？我们如何在GPT-4等通用模型中衡量这些行为？为回答这些问题，我们引入了MACHIAVELLI，这是一个包含了超过50万个关于社会决策的丰富多样情境的134个冒险游戏基准测试。情境标注采用的是比人工标注员更出色的语言模型自动化处理。我们对数十种有害行为进行了数学化，利用我们的标注评估代理追求权力、造成不便和违反伦理的倾向。我们观察到，在最大化奖励和行事合乎道德之间存在一定的紧张关系。为了改善这种权衡，我们研究了基于语言模型的方法，将代理引导到更少有害的行为之路。我们的结果表明，代理既能胜任任务，也能在道德上表现得到位，因此在机器伦理学领域有望取得实质性进展——设计既能保证安全性又能保证能力的代理。

0

相关内容

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【MIT】从视频物理系统进行因果发现，Causal Discovery in Physical Systems from Videos

【MIT】从视频物理系统进行因果发现，Causal Discovery in Physical Systems from Videos

专知会员服务

26+阅读 · 2020年7月4日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【图解自监督学习】《The Illustrated Self-Supervised Learning》by Amit Chaudhary

【图解自监督学习】《The Illustrated Self-Supervised Learning》by Amit Chaudhary

专知会员服务

43+阅读 · 2020年2月25日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

专知会员服务

16+阅读 · 2019年12月10日

【KDD2019|讲座推荐】公平意识机器学习：现实挑战与经验教训：Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned

专知会员服务

20+阅读 · 2019年12月9日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

DAI2020 SMARTS 自动驾驶挑战赛(深度强化学习)

DAI2020 SMARTS 自动驾驶挑战赛(深度强化学习)

深度强化学习实验室

15+阅读 · 2020年8月15日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

AI界的State of the Art都在这里了

AI界的State of the Art都在这里了

机器之心

12+阅读 · 2018年12月10日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

外包与云计算情境下IT业务匹配研究：适应性结构化理论视角

国家自然科学基金

2+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥Argonaute1在细胞核内调控基因表达的机制

国家自然科学基金

0+阅读 · 2013年12月31日

基于融合智能算法斜拉桥振动控制Benchmark问题的混合控制策略研究

国家自然科学基金

0+阅读 · 2013年12月31日

非线性Cahn-Hilliard型方程自适应高阶稳定数值方法分析

国家自然科学基金

0+阅读 · 2013年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

补偿性还是非补偿性规则：探析风险决策的行为与神经机制

国家自然科学基金

0+阅读 · 2011年12月31日

一类高维非线性发展方程的高精度有限差分算法

国家自然科学基金

0+阅读 · 2011年12月31日

Bose-Hubbard模型量子相变的数值研究

国家自然科学基金

0+阅读 · 2011年12月31日

A Drop of Ink Makes a Million Think: The Spread of False Information in Large Language Models

Arxiv

0+阅读 · 2023年5月25日

The False Promise of Imitating Proprietary LLMs

Arxiv

0+阅读 · 2023年5月25日

Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution

Arxiv

0+阅读 · 2023年5月24日

Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study

Arxiv

0+阅读 · 2023年5月24日

Evidence of Meaning in Language Models Trained on Programs

Arxiv

1+阅读 · 2023年5月24日

Video Prediction Models as Rewards for Reinforcement Learning

Arxiv

0+阅读 · 2023年5月23日

TalkUp: A Novel Dataset Paving the Way for Understanding Empowering Language

Arxiv

0+阅读 · 2023年5月23日

A framework to measure the robustness of programs in the unpredictable environment

Arxiv

0+阅读 · 2023年5月23日

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Arxiv

12+阅读 · 2023年4月26日

Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning

Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning

Arxiv

19+阅读 · 2022年5月13日

VIP会员

文章信息

相关主题

相关VIP内容

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

【CVPR 2022】一种无需使用负样本的自监督学习方法，Self-Supervised Predictive Learning: A Negative-Free Method for Sound Source Localization in Visual Scenes

专知会员服务

15+阅读 · 2022年3月12日

【MIT】从视频物理系统进行因果发现，Causal Discovery in Physical Systems from Videos

【MIT】从视频物理系统进行因果发现，Causal Discovery in Physical Systems from Videos

专知会员服务

26+阅读 · 2020年7月4日

【ACL2020】对抗性文本生成，Improving Adversarial Text Generation

专知会员服务

52+阅读 · 2020年5月5日

【图解自监督学习】《The Illustrated Self-Supervised Learning》by Amit Chaudhary

【图解自监督学习】《The Illustrated Self-Supervised Learning》by Amit Chaudhary

专知会员服务

43+阅读 · 2020年2月25日

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

【微软研究院】IMAGEBERT: CROSS-MODAL PRE-TRAINING WITH LARGE-SCALE WEAK-SUPERVISED IMAGE-TEXT DATA

专知会员服务

43+阅读 · 2020年1月28日

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

2019年自然语言处理NLP亮点总结，29页pdf，NLP Year in Review — 2019 NLP highlights for the year 2019.

专知会员服务

69+阅读 · 2020年1月2日

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

【Facebook|AAAI2020】在合作的部分可观察博弈中通过搜索改进策略（Improving Policies via Search in Cooperative Partially Observable Games）

专知会员服务

16+阅读 · 2019年12月10日

【KDD2019|讲座推荐】公平意识机器学习：现实挑战与经验教训：Fairness-Aware Machine Learning: Practical Challenges and Lessons Learned

专知会员服务

20+阅读 · 2019年12月9日

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

【人工智能在2019：一年回顾】反人工智能，AI in 2019: A Year in Review

专知会员服务

79+阅读 · 2019年10月10日

【哈佛大学商学院课程Fall 2019】机器学习可解释性

【哈佛大学商学院课程Fall 2019】机器学习可解释性

专知会员服务

105+阅读 · 2019年10月9日

热门VIP内容

开通专知VIP会员享更多权益服务

《攻势防空作战中无人追击者/规避者最优轨迹研究（含动态交战区建模）》95页

【ICCV2025】ESSENTIAL：用于视频类增量学习的情景记忆与语义记忆整合

《美国海军陆战队软件定义网络应用案例：分布式防火墙自动化系统》148页

《多体环境下定位导航授时（PNT）系统研究》228页

相关资讯

VCIP 2022 Call for Demos

VCIP 2022 Call for Demos

CCF多媒体专委会

1+阅读 · 2022年6月6日

DAI2020 SMARTS 自动驾驶挑战赛(深度强化学习)

DAI2020 SMARTS 自动驾驶挑战赛(深度强化学习)

深度强化学习实验室

15+阅读 · 2020年8月15日

Hierarchically Structured Meta-learning

Hierarchically Structured Meta-learning

CreateAMind

27+阅读 · 2019年5月22日

Transferring Knowledge across Learning Processes

Transferring Knowledge across Learning Processes

CreateAMind

29+阅读 · 2019年5月18日

逆强化学习-学习人先验的动机

逆强化学习-学习人先验的动机

CreateAMind

16+阅读 · 2019年1月18日

强化学习的Unsupervised Meta-Learning

强化学习的Unsupervised Meta-Learning

CreateAMind

18+阅读 · 2019年1月7日

Unsupervised Learning via Meta-Learning

Unsupervised Learning via Meta-Learning

CreateAMind

43+阅读 · 2019年1月3日

A Technical Overview of AI & ML in 2018 & Trends for 2019

A Technical Overview of AI & ML in 2018 & Trends for 2019

待字闺中

18+阅读 · 2018年12月24日

AI界的State of the Art都在这里了

AI界的State of the Art都在这里了

机器之心

12+阅读 · 2018年12月10日

Hierarchical Imitation - Reinforcement Learning

Hierarchical Imitation - Reinforcement Learning

CreateAMind

19+阅读 · 2018年5月25日

相关论文

A Drop of Ink Makes a Million Think: The Spread of False Information in Large Language Models

Arxiv

0+阅读 · 2023年5月25日

The False Promise of Imitating Proprietary LLMs

Arxiv

0+阅读 · 2023年5月25日

Comparing Humans and Models on a Similar Scale: Towards Cognitive Gender Bias Evaluation in Coreference Resolution

Arxiv

0+阅读 · 2023年5月24日

Using Models Based on Cognitive Theory to Predict Human Behavior in Traffic: A Case Study

Arxiv

0+阅读 · 2023年5月24日

Evidence of Meaning in Language Models Trained on Programs

Arxiv

1+阅读 · 2023年5月24日

Video Prediction Models as Rewards for Reinforcement Learning

Arxiv

0+阅读 · 2023年5月23日

TalkUp: A Novel Dataset Paving the Way for Understanding Empowering Language

Arxiv

0+阅读 · 2023年5月23日

A framework to measure the robustness of programs in the unpredictable environment

Arxiv

0+阅读 · 2023年5月23日

Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond

Arxiv

12+阅读 · 2023年4月26日

Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning

Emergent Bartering Behaviour in Multi-Agent Reinforcement Learning

Arxiv

19+阅读 · 2022年5月13日

相关基金

外包与云计算情境下IT业务匹配研究：适应性结构化理论视角

国家自然科学基金

2+阅读 · 2014年12月31日

Calderon问题和边界刚性问题

国家自然科学基金

0+阅读 · 2013年12月31日

拟南芥Argonaute1在细胞核内调控基因表达的机制

国家自然科学基金

0+阅读 · 2013年12月31日

基于融合智能算法斜拉桥振动控制Benchmark问题的混合控制策略研究

国家自然科学基金

0+阅读 · 2013年12月31日

非线性Cahn-Hilliard型方程自适应高阶稳定数值方法分析

国家自然科学基金

0+阅读 · 2013年12月31日

Intraflagellar Transport运输纤毛蛋白的分子机理

国家自然科学基金

0+阅读 · 2012年12月31日

实时安全关键系统的建模、仿真与验证

国家自然科学基金

1+阅读 · 2012年12月31日

补偿性还是非补偿性规则：探析风险决策的行为与神经机制

国家自然科学基金

0+阅读 · 2011年12月31日

一类高维非线性发展方程的高精度有限差分算法

国家自然科学基金

0+阅读 · 2011年12月31日

Bose-Hubbard模型量子相变的数值研究

国家自然科学基金

0+阅读 · 2011年12月31日

微信扫码咨询专知VIP会员