机器内部N400：定位因果语言模型检测语义违例的位置 (In Machina N400: Pinpointing Where a Causal Language Model Detects Semantic Violations)

How and where does a transformer notice that a sentence has gone semantically off the rails? To explore this question, we evaluated the causal language model (phi-2) using a carefully curated corpus, with sentences that concluded plausibly or implausibly. Our analysis focused on the hidden states sampled at each model layer. To investigate how violations are encoded, we utilized two complementary probes. First, we conducted a per-layer detection using a linear probe. Our findings revealed that a simple linear decoder struggled to distinguish between plausible and implausible endings in the lowest third of the model's layers. However, its accuracy sharply increased in the middle blocks, reaching a peak just before the top layers. Second, we examined the effective dimensionality of the encoded violation. Initially, the violation widens the representational subspace, followed by a collapse after a mid-stack bottleneck. This might indicate an exploratory phase that transitions into rapid consolidation. Taken together, these results contemplate the idea of alignment with classical psycholinguistic findings in human reading, where semantic anomalies are detected only after syntactic resolution, occurring later in the online processing sequence.

翻译：Transformer模型如何以及在哪里察觉到句子语义偏离正轨？为探究此问题，我们使用精心构建的语料库评估了因果语言模型（phi-2），其中包含合理或不合理结尾的句子。分析聚焦于各模型层采样的隐藏状态。为研究违例编码机制，我们采用两种互补探针：首先通过线性探针进行逐层检测，发现简单线性解码器在模型底层三分之一处难以区分合理与不合理结尾，但在中间模块准确率急剧上升，于顶层前达到峰值；其次考察编码违例的有效维度，发现违例会先扩大表征子空间，随后在中间瓶颈层后坍缩，这可能标志着从探索阶段向快速整合阶段的过渡。综合结果表明，该机制与人类阅读中的经典心理语言学发现存在对应关系——语义异常仅在句法解析后、在线处理序列的后期才被检测到。