MMCOVaR:用于假新闻探测和分类基准结构的多式COVID-19疫苗重点数据储存库 (MMCoVaR: Multimodal COVID-19 Vaccine Focused Data Repository for Fake News Detection and a Baseline Architecture for Classification)

The outbreak of COVID-19 has resulted in an "infodemic" that has encouraged the propagation of misinformation about COVID-19 and cure methods which, in turn, could negatively affect the adoption of recommended public health measures in the larger population. In this paper, we provide a new multimodal (consisting of images, text and temporal information) labeled dataset containing news articles and tweets on the COVID-19 vaccine. We collected 2,593 news articles from 80 publishers for one year between Feb 16th 2020 to May 8th 2021 and 24184 Twitter posts (collected between April 17th 2021 to May 8th 2021). We combine ratings from two news media ranking sites: Medias Bias Chart and Media Bias/Fact Check (MBFC) to classify the news dataset into two levels of credibility: reliable and unreliable. The combination of two filters allows for higher precision of labeling. We also propose a stance detection mechanism to annotate tweets into three levels of credibility: reliable, unreliable and inconclusive. We provide several statistics as well as other analytics like, publisher distribution, publication date distribution, topic analysis, etc. We also provide a novel architecture that classifies the news data into misinformation or truth to provide a baseline performance for this dataset. We find that the proposed architecture has an F-Score of 0.919 and accuracy of 0.882 for fake news detection. Furthermore, we provide benchmark performance for misinformation detection on tweet dataset. This new multimodal dataset can be used in research on COVID-19 vaccine, including misinformation detection, influence of fake COVID-19 vaccine information, etc.

翻译：COVID-19的爆发产生了“信息”效应,鼓励传播有关COVID-19-19的错误信息以及治疗方法,这反过来又可能对在较大人口群体中采用建议公共卫生措施产生消极影响。在本文中,我们提供了一个新的多式联运(包括图像、文本和时间信息)标签数据集,其中载有关于COVID-19疫苗的新闻文章和推文。我们从2020年2月16日至2021年5月8日和24184年的80家出版商收集了2 593篇新闻文章,鼓励在2021年4月17日至2021年5月8日和24184月的Twitter文章(收集了2021日至2021年5月8日)之间)。我们把两个新闻媒体排名站的评级结合起来:媒体Bias Chart和Media Bias/Fact Check(MBFFCFC),将新闻数据集分为两个级别:可靠和不可靠。我们还提议了立场检测机制,将微量推文推文评分为三个等级:可靠、不可靠和无结论。我们提供了若干统计数据,以及其它分析分析数据,例如出版发行发行日期、发行日期、发行日期、发行日期分配、专题分析等数据,供我们使用这一数据库。我们提供最新数据。