We show how to assess a language model's knowledge of basic concepts of morality. We introduce the ETHICS dataset, a new benchmark that spans concepts in justice, well-being, duties, virtues, and commonsense morality. Models predict widespread moral judgments about diverse text scenarios. This requires connecting physical and social world knowledge to value judgements, a capability that may enable us to steer chatbot outputs or eventually regularize open-ended reinforcement learning agents. With the ETHICS dataset, we find that current language models have a promising but incomplete ability to predict basic human ethical judgements. Our work shows that progress can be made on machine ethics today, and it provides a steppingstone toward AI that is aligned with human values.
翻译:我们展示了如何评估语言模型对基本道德概念的知识。我们引入了ETHICS数据集,这是一个涵盖正义、福祉、义务、美德和常识道德等概念的新基准。模型预测了对不同文本情景的广泛道德判断。这要求将物质和社会世界知识与价值判断联系起来,这种能力可以引导聊天机产出或最终规范开放式强化学习工具。我们发现,通过ETHICS数据集,现有语言模型具有预测人类基本道德判断的有希望但不完整的能力。我们的工作表明,今天在机器道德方面可以取得进展,它提供了与人类价值观相一致的AI的踏脚石。