Do norms of rationality apply to machine learning models, in particular language models? In this paper we investigate this question by focusing on a special subset of rational norms: coherence norms. We consider both logical coherence norms as well as coherence norms tied to the strength of belief. To make sense of the latter, we introduce the Minimal Assent Connection (MAC) and propose a new account of credence, which captures the strength of belief in language models. This proposal uniformly assigns strength of belief simply on the basis of model internal next token probabilities. We argue that rational norms tied to coherence do apply to some language models, but not to others. This issue is significant since rationality is closely tied to predicting and explaining behavior, and thus it is connected to considerations about AI safety and alignment, as well as understanding model behavior more generally.
翻译:理性规范是否适用于机器学习模型,特别是语言模型?本文通过聚焦于理性规范的一个特殊子集——一致性规范来探讨这一问题。我们同时考察逻辑一致性规范以及与信念强度相关的一致性规范。为理解后者,我们引入了最小认同关联(MAC)并提出一种新的置信度解释框架,该框架能够捕捉语言模型中的信念强度。该方案基于模型内部的下一个词元概率统一分配信念强度。我们认为,与一致性相关的理性规范适用于某些语言模型,但不适用于其他模型。这一问题具有重要意义,因为理性与行为预测和解释密切相关,因此涉及人工智能安全与对齐的考量,以及对模型行为的更广泛理解。