This paper presents and characterizes MIND, a new Portuguese corpus comprised of different types of articles collected from online mainstream and alternative media sources, over a 10-month period. The articles in the corpus are organized into five collections: facts, opinions, entertainment, satires, and conspiracy theories. Throughout this paper, we explain how the data collection process was conducted, and present a set of linguistic metrics that allow us to perform a preliminary characterization of the texts included in the corpus. Also, we deliver an analysis of the most frequent topics in the corpus, and discuss the main differences and similarities among the collections considered. Finally, we enumerate some tasks and applications that could benefit from this corpus, in particular the ones (in)directly related to misinformation detection. Overall, our contribution of a corpus and initial analysis are designed to support future exploratory news studies, and provide a better insight into misinformation.
翻译:本文介绍并描述由10个月期间从网上主流和替代性媒体来源收集的不同类型文章组成的葡萄牙新文集MIND, 文集中的文章分为五类:事实、意见、娱乐、讽刺和阴谋理论。我们在整个文件中解释了数据收集过程是如何进行的,并提出了一套语言指标,使我们能够对文集中包括的案文进行初步定性。此外,我们分析了文集中最经常出现的主题,并讨论了所考虑的文集之间的主要差异和相似之处。最后,我们列举了本文集中可以受益的一些任务和应用,特别是与发现错误信息直接相关的那些任务和应用。总体而言,我们编写文集和初步分析的目的是支持未来的探索性新闻研究,并更好地了解错误信息。