Large Language Models (LLMs) are increasingly deployed in high-stakes financial domains, yet they suffer from specific, reproducible hallucinations when performing arithmetic operations. Current mitigation strategies often treat the model as a black box. In this work, we propose a mechanistic approach to intrinsic hallucination detection. By applying Causal Tracing to the GPT-2 XL architecture on the ConvFinQA benchmark, we identify a dual-stage mechanism for arithmetic reasoning: a distributed computational scratchpad in middle layers (L12-L30) and a decisive aggregation circuit in late layers (specifically Layer 46). We verify this mechanism via an ablation study, demonstrating that suppressing Layer 46 reduces the model's confidence in hallucinatory outputs by 81.8%. Furthermore, we demonstrate that a linear probe trained on this layer generalizes to unseen financial topics with 98% accuracy, suggesting a universal geometry of arithmetic deception.
翻译:大语言模型(LLMs)正日益被部署于高风险金融领域,但在执行算术运算时仍存在特定且可复现的幻觉。当前的缓解策略通常将模型视为黑箱。在本研究中,我们提出了一种机制性的内在幻觉检测方法。通过在ConvFinQA基准上对GPT-2 XL架构应用因果追踪,我们识别出算术推理的双阶段机制:中间层(L12-L30)中的分布式计算草稿区,以及深层(特别是第46层)中的决定性聚合电路。我们通过消融实验验证了该机制,证明抑制第46层可将模型对幻觉输出的置信度降低81.8%。此外,我们发现在该层训练的线性探针能以98%的准确率泛化至未见过的金融主题,这表明算术欺骗存在一种普适的几何结构。