The COVID-19 pandemic has fueled the spread of misinformation on social media and the Web as a whole. The phenomenon dubbed `infodemic' has taken the challenges of information veracity and trust to new heights by massively introducing seemingly scientific and technical elements into misleading content. Despite the existing body of work on modeling and predicting misinformation, the coverage of very complex scientific topics with inherent uncertainty and an evolving set of findings, such as COVID-19, provides many new challenges that are not easily solved by existing tools. To address these issues, we introduce SciLander, a method for learning representations of news sources reporting on science-based topics. SciLander extracts four heterogeneous indicators for the news sources; two generic indicators that capture (1) the copying of news stories between sources, and (2) the use of the same terms to mean different things (i.e., the semantic shift of terms), and two scientific indicators that capture (1) the usage of jargon and (2) the stance towards specific citations. We use these indicators as signals of source agreement, sampling pairs of positive (similar) and negative (dissimilar) samples, and combine them in a unified framework to train unsupervised news source embeddings with a triplet margin loss objective. We evaluate our method on a novel COVID-19 dataset containing nearly 1M news articles from 500 sources spanning a period of 18 months since the beginning of the pandemic in 2020. Our results show that the features learned by our model outperform state-of-the-art baseline methods on the task of news veracity classification. Furthermore, a clustering analysis suggests that the learned representations encode information about the reliability, political leaning, and partisanship bias of these sources.
翻译:为解决这些问题,我们引入了SciLander, 这是一种学习关于科学主题的新闻报道来源的表述的方法。SciLander为新闻来源提取了四种不同的指标;两种通用指标显示:(1) 各来源之间新闻故事的复制,(2) 使用相同的术语来表示不同的东西(即语义的变换),以及两个科学指标显示(1) 术语模型的使用情况和(2) 具体引用的姿态。我们用这些指标作为来源协议的信号,对正(类似)和负(类似)数字的基流数据样本进行抽样分析,并将这些数据与我们所了解的18个基流数据源的原始数据整合在一起。我们用一个模型来显示我们所了解的18个基流数据源的原始数据。