It has been argued that fake news and the spread of false information pose a threat to societies throughout the world, from influencing the results of elections to hindering the efforts to manage the COVID-19 pandemic. To combat this threat, a number of Natural Language Processing (NLP) approaches have been developed. These leverage a number of datasets, feature extraction/selection techniques and machine learning (ML) algorithms to detect fake news before it spreads. While these methods are well-documented, there is less evidence regarding their efficacy in this domain. By systematically reviewing the literature, this paper aims to delineate the approaches for fake news detection that are most performant, identify limitations with existing approaches, and suggest ways these can be mitigated. The analysis of the results indicates that Ensemble Methods using a combination of news content and socially-based features are currently the most effective. Finally, it is proposed that future research should focus on developing approaches that address generalisability issues (which, in part, arise from limitations with current datasets), explainability and bias.
翻译:有人争辩说,假新闻和虚假信息的传播对全世界社会构成威胁,从影响选举结果到阻碍管理COVID-19大流行的努力,从影响选举结果到阻碍管理COVID-19大流行的努力。为了对付这一威胁,已经制定了一些自然语言处理(NLP)办法,这些办法利用了一些数据集、特征提取/选择技术和机器学习算法来在传播之前探测假新闻。虽然这些方法有详细记录,但关于它们在这一领域的效力的证据却较少。通过系统审查文献,本文件旨在说明假新闻探测方法最有表现,查明现有方法的局限性,并提出可以减轻这些局限性的方法。对结果的分析表明,利用新闻内容和社会特征相结合的编集方法目前最为有效。最后,建议今后的研究应侧重于制定办法,解决可普遍性问题(部分是因为目前数据集的限制)、可解释性和偏见。