The proliferation of fake news and its propagation on social media has become a major concern due to its ability to create devastating impacts. Different machine learning approaches have been suggested to detect fake news. However, most of those focused on a specific type of news (such as political) which leads us to the question of dataset-bias of the models used. In this research, we conducted a benchmark study to assess the performance of different applicable machine learning approaches on three different datasets where we accumulated the largest and most diversified one. We explored a number of advanced pre-trained language models for fake news detection along with the traditional and deep learning ones and compared their performances from different aspects for the first time to the best of our knowledge. We find that BERT and similar pre-trained models perform the best for fake news detection, especially with very small dataset. Hence, these models are significantly better option for languages with limited electronic contents, i.e., training data. We also carried out several analysis based on the models' performance, article's topic, article's length, and discussed different lessons learned from them. We believe that this benchmark study will help the research community to explore further and news sites/blogs to select the most appropriate fake news detection method.
翻译:假新闻的扩散及其在社交媒体上的传播已成为一个主要关注问题,因为它能够造成毁灭性的影响。建议采用不同的机器学习方法来探测假新闻。然而,大多数这些方法都集中在特定类型的新闻(例如政治)上,导致我们发现所使用的模型的数据集偏差问题。在这个研究中,我们进行了一项基准研究,以评估在三个不同的数据集上不同应用的机器学习方法的绩效,我们在那里积累了最大和最多样化的数据。我们探索了一些先进的预先训练语言模型,以便同传统和深层次的学习模型一起进行假新闻探测,并将它们从不同方面的表现与我们知识的最佳程度进行比较。我们发现,BERT和类似的预先训练模型在假新闻探测方面表现最佳,特别是使用非常小的数据集。因此,这些模型对于电子内容有限的语言来说,也就是培训数据,是更好的选择。我们还根据模型的绩效、文章的专题、文章的长度和从它们中汲取的不同教训进行了几次分析。我们认为,这一基准研究将有助于研究界进一步探索,并选择最适当的新闻网站/博客。