Automated bias detection in news text is heavily used to support journalistic analysis and media accountability, yet little is known about how bias detection models arrive at their decisions or why they fail. In this work, we present a comparative interpretability study of two transformer-based bias detection models: a bias detector fine-tuned on the BABE dataset and a domain-adapted pre-trained RoBERTa model fine-tuned on the BABE dataset, using SHAP-based explanations. We analyze word-level attributions across correct and incorrect predictions to characterize how different model architectures operationalize linguistic bias. Our results show that although both models attend to similar categories of evaluative language, they differ substantially in how these signals are integrated into predictions. The bias detector model assigns stronger internal evidence to false positives than to true positives, indicating a misalignment between attribution strength and prediction correctness and contributing to systematic over-flagging of neutral journalistic content. In contrast, the domain-adaptive model exhibits attribution patterns that better align with prediction outcomes and produces 63\% fewer false positives. We further demonstrate that model errors arise from distinct linguistic mechanisms, with false positives driven by discourse-level ambiguity rather than explicit bias cues. These findings highlight the importance of interpretability-aware evaluation for bias detection systems and suggest that architectural and training choices critically affect both model reliability and deployment suitability in journalistic contexts.
翻译:新闻文本的自动化偏见检测被广泛用于支持新闻分析和媒体问责,然而关于偏见检测模型如何得出决策或为何失败却知之甚少。本研究对两种基于Transformer的偏见检测模型进行了比较性可解释性分析:一种是在BABE数据集上微调的偏见检测器,另一种是在BABE数据集上微调的领域自适应预训练RoBERTa模型,均采用基于SHAP的解释方法。通过分析正确与错误预测中的词级归因,我们刻画了不同模型架构如何将语言偏见操作化。研究结果表明,尽管两种模型都关注相似类别的评价性语言,但在将这些信号整合到预测中的方式上存在显著差异。偏见检测器模型对误报赋予的内部证据强度高于真报,表明归因强度与预测正确性之间存在错配,这导致了对中性新闻内容的系统性过度标记。相比之下,领域自适应模型展现出与预测结果更吻合的归因模式,并将误报减少了63%。我们进一步证明模型误差源于不同的语言机制:误报主要由语篇层面的模糊性驱动,而非显性的偏见线索。这些发现凸显了可解释性感知评估对偏见检测系统的重要性,并表明架构和训练选择会关键性地影响模型在新闻场景中的可靠性与部署适用性。