Recent advances in text classification and knowledge capture in language models have relied on availability of large-scale text datasets. However, language models are trained on static snapshots of knowledge and are limited when that knowledge evolves. This is especially critical for misinformation detection, where new types of misinformation continuously appear, replacing old campaigns. We propose time-aware misinformation datasets to capture time-critical phenomena. In this paper, we first present evidence of evolving misinformation and show that incorporating even simple time-awareness significantly improves classifier accuracy. Second, we present COVID-TAD, a large-scale COVID-19 misinformation da-taset spanning 25 months. It is the first large-scale misinformation dataset that contains multiple snapshots of a datastream and is orders of magnitude bigger than related misinformation datasets. We describe the collection and labeling pro-cess, as well as preliminary experiments.
翻译:语言模型的文本分类和知识捕捉方面最近的进展依赖于大规模文本数据集的可用性,但是,语言模型是按知识静态简况进行训练的,在知识发展时是有限的,对于发现错误信息、不断出现新类型的错误信息、取代旧的运动来说,这尤其重要。我们提出有时间意识的错误数据组,以捕捉时间紧迫的现象。在本文件中,我们首先提出不断演变的错误信息证据,并表明即使包括简单的时间认识也大大提高了分类的准确性。第二,我们提出了为期25个月的大规模COVID-TAD,即一个大型的COVID-19错误信息达星集。这是第一个大型错误数据集,包含数据流的多重快照,其数量大于相关的错误数据集。我们描述了收集和标注支持数据组以及初步实验。