As social media becomes increasingly prominent in our day to day lives, it is increasingly important to detect informative content and prevent the spread of disinformation and unverified rumours. While many sophisticated and successful models have been proposed in the literature, they are often compared with older NLP baselines such as SVMs, CNNs, and LSTMs. In this paper, we examine the performance of a broad set of modern transformer-based language models and show that with basic fine-tuning, these models are competitive with and can even significantly outperform recently proposed state-of-the-art methods. We present our framework as a baseline for creating and evaluating new methods for misinformation detection. We further study a comprehensive set of benchmark datasets, and discuss potential data leakage and the need for careful design of the experiments and understanding of datasets to account for confounding variables. As an extreme case example, we show that classifying only based on the first three digits of tweet ids, which contain information on the date, gives state-of-the-art performance on a commonly used benchmark dataset for fake news detection --Twitter16. We provide a simple tool to detect this problem and suggest steps to mitigate it in future datasets.
翻译:随着社交媒体在日常生活中越来越显眼,发现信息内容和防止虚假信息传播和未经核实的谣言传播越来越重要。虽然文献中提出了许多复杂和成功的模型,但它们往往与旧的NLP基线,如SVMS、CNNs和LSTMs相比。在本文中,我们审视了一套广泛的基于现代变压器的变压器语言模型的性能,并表明这些模型经过基本微调后,与最近提出的最新最新方法具有竞争力,甚至能够大大超过最新水平。我们提出我们的框架,作为建立和评估新发现错误信息方法的基线。我们进一步研究一套全面的基准数据集,并讨论潜在的数据渗漏以及仔细设计实验和对数据集的理解的必要性,以说明各种混杂变量。我们以极端的例子为例,我们显示仅仅根据含有日期信息的推特代号的前三位数字进行分类,就能在常用的基准数据集上实现最新表现。我们提供了一个简单的工具,用以测量这一问题,并在未来采取措施缓解这一问题。我们提供了一个简单的工具。