The proliferation of linguistically subtle political disinformation poses a significant challenge to automated fact-checking systems. Despite increasing emphasis on complex neural architectures, the empirical limits of text-only linguistic modeling remain underexplored. We present a systematic diagnostic evaluation of nine machine learning algorithms on the LIAR benchmark. By isolating lexical features (Bag-of-Words, TF-IDF) and semantic embeddings (GloVe), we uncover a hard "Performance Ceiling", with fine-grained classification not exceeding a Weighted F1-score of 0.32 across models. Crucially, a simple linear SVM (Accuracy: 0.624) matches the performance of pre-trained Transformers such as RoBERTa (Accuracy: 0.620), suggesting that model capacity is not the primary bottleneck. We further diagnose a massive "Generalization Gap" in tree-based ensembles, which achieve more than 99% training accuracy but collapse to approximately 25% on test data, indicating reliance on lexical memorization rather than semantic inference. Synthetic data augmentation via SMOTE yields no meaningful gains, confirming that the limitation is semantic (feature ambiguity) rather than distributional. These findings indicate that for political fact-checking, increasing model complexity without incorporating external knowledge yields diminishing returns.
翻译:语言层面微妙的政治虚假信息泛滥对自动化事实核查系统构成了重大挑战。尽管复杂神经架构日益受到重视,但纯文本语言建模的实证局限性仍未得到充分探索。我们对LIAR基准数据集上的九种机器学习算法进行了系统性诊断评估。通过分离词汇特征(词袋模型、TF-IDF)和语义嵌入(GloVe),我们揭示了一个难以逾越的"性能天花板":所有模型在细粒度分类任务中的加权F1分数均未超过0.32。关键发现表明,简单线性支持向量机(准确率:0.624)与预训练Transformer模型如RoBERTa(准确率:0.620)性能相当,这意味着模型容量并非主要瓶颈。我们进一步诊断出树集成方法存在巨大的"泛化鸿沟":这些模型在训练集上达到超过99%的准确率,但在测试集上骤降至约25%,表明其依赖词汇记忆而非语义推理。通过SMOTE进行的合成数据增强未产生实质性改善,证实了当前局限在于语义特征(特征模糊性)而非数据分布问题。这些发现表明,对于政治事实核查任务,在不引入外部知识的情况下单纯增加模型复杂度将导致收益递减。