Fake news detection has become a major task to solve as there has been an increasing number of fake news on the internet in recent years. Although many classification models have been proposed based on statistical learning methods showing good results, reasoning behind the classification performances may not be enough. In the self-supervised learning studies, it has been highlighted that a quality of representation (embedding) space matters and directly affects a downstream task performance. In this study, a quality of the representation space is analyzed visually and analytically in terms of linear separability for different classes on a real and fake news dataset. To further add interpretability to a classification model, a modification of Class Activation Mapping (CAM) is proposed. The modified CAM provides a CAM score for each word token, where the CAM score on a word token denotes a level of focus on that word token to make the prediction. Finally, it is shown that the naive BERT model topped with a learnable linear layer is enough to achieve robust performance while being compatible with CAM.
翻译:由于近年来互联网上假新闻数量不断增加,虚假新闻探测已成为一项需要解决的重大任务。虽然根据统计学习方法提出了许多分类模式,显示良好结果,但分类性能背后的推理可能还不够。在自我监督的学习研究中,人们强调,代表(编造)空间质量很重要,直接影响到下游任务性能。在这项研究中,对代表空间的质量进行了视觉分析和分析分析,从线性分离的角度分析真实和假新闻数据集上的不同类别。为了进一步增加分类模型的可解释性,建议修改分类活动映射(CAM) 。修改后CAM 提供了每个字的CAM 评分,其中CAM 表示一个字表示对字符号的重视程度,以作出预测。最后,研究表明,与CAM 兼容的天真的BERT 模型具有可学习的线性层,足以实现稳健的性能。