As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.
翻译:随着人工智能系统的发展,我们越来越多地依赖它们与我们共同或替我们做出决策。为确保这些决策符合人类价值观,我们不仅需要了解决策结果,还必须理解其决策过程。推理语言模型既能提供最终响应,又能(部分透明地)展示中间思考轨迹,这为研究人工智能的程序性推理提供了适时的契机。与通常存在客观正确答案的数学和代码问题不同,道德困境是聚焦推理过程评估的理想测试平台,因为它们允许多种可辩护的结论。为此,我们提出MoReBench:包含1,000个道德场景的数据集,每个场景均配有专家认为在推理时必须涵盖(或避免)的评估标准集合。MoReBench涵盖超过2.3万条标准,包括识别道德考量、权衡利弊以及提供可操作建议,覆盖人工智能辅助人类道德决策及自主道德决策两类场景。此外,我们构建了MoReBench-Theory:包含150个示例,用于测试人工智能能否在规范伦理学的五大理论框架下进行推理。实验结果表明,基于数学、代码和科学推理任务建立的扩展定律与现有基准,均无法有效预测模型执行道德推理的能力。模型还表现出对特定道德框架(如边沁式行为功利主义与康德义务论)的偏向性,这可能是主流训练范式产生的副作用。这些基准共同推动了以过程为中心的推理评估,朝着更安全、更透明的人工智能迈进。