AI systems are becoming increasingly intertwined with human life. In order to effectively collaborate with humans and ensure safety, AI systems need to be able to understand, interpret and predict human moral judgments and decisions. Human moral judgments are often guided by rules, but not always. A central challenge for AI safety is capturing the flexibility of the human moral mind -- the ability to determine when a rule should be broken, especially in novel or unusual situations. In this paper, we present a novel challenge set consisting of rule-breaking question answering (RBQA) of cases that involve potentially permissible rule-breaking -- inspired by recent moral psychology studies. Using a state-of-the-art large language model (LLM) as a basis, we propose a novel moral chain of thought (MORALCOT) prompting strategy that combines the strengths of LLMs with theories of moral reasoning developed in cognitive science to predict human moral judgments. MORALCOT outperforms seven existing LLMs by 6.2% F1, suggesting that modeling human reasoning might be necessary to capture the flexibility of the human moral mind. We also conduct a detailed error analysis to suggest directions for future work to improve AI safety using RBQA. Our data and code are available at https://github.com/feradauto/MoralCoT
翻译:为了有效地与人类合作并确保安全,AI系统需要能够理解、解释和预测人类道德判断和决定。 人类道德判断往往以规则为指导,但并非总是以规则为指导。 AI安全的一个中心挑战在于人类道德思想的灵活性 -- -- 确定何时应该打破规则的能力,特别是在新的或不寻常的情况下。 在本文件中,我们提出了一套新的挑战,包括规则破碎问题回答(RBQA),其中涉及可能允许的规则破碎的案例 -- -- 受最近道德心理学研究的启发。我们用最先进的大语言模型(LLLM)作为基础,我们提出了一个新的道德思维链(MORALCOT)激励战略,将LOMS的长处与认知科学中为预测人类道德判断而形成的道德推理理论结合起来。MORALCOT比6.2% F1的7个现有LMM(LMs)高出了6.2% F1,建议模拟人类道德意识的灵活性可能是必要的。我们还进行详细的错误分析,以建议未来工作的方向,用RBQA/MOUTA来改进AI的安全。我们的数据和代码是可在RBQA/MTURA/MSUTAR。