MMCOVaR:用于假新闻探测和分类基准结构的多式COVID-19疫苗重点数据储存库 (MMCoVaR: Multimodal COVID-19 Vaccine Focused Data Repository for Fake News Detection and a Baseline Architecture for Classification)

The outbreak of COVID-19 has resulted in an "infodemic" that has encouraged the propagation of misinformation about COVID-19 and cure methods which, in turn, could negatively affect the adoption of recommended public health measures in the larger population. In this paper, we provide a new multimodal (consisting of images, text and temporal information) labeled dataset containing news articles and tweets on the COVID-19 vaccine. We collected 2,593 news articles from 80 publishers for one year between Feb 16th 2020 to May 8th 2021 and 24184 Twitter posts (collected between April 17th 2021 to May 8th 2021). We combine ratings from three news media ranking sites: Medias Bias Chart, News Guard and Media Bias/Fact Check (MBFC) to classify the news dataset into two levels of credibility: reliable and unreliable. The combination of three filters allows for higher precision of labeling. We also propose a stance detection mechanism to annotate tweets into three levels of credibility: reliable, unreliable and inconclusive. We provide several statistics as well as other analytics like, publisher distribution, publication date distribution, topic analysis, etc. We also provide a novel architecture that classifies the news data into misinformation or truth to provide a baseline performance for this dataset. We find that the proposed architecture has an F-Score of 0.919 and accuracy of 0.882 for fake news detection. Furthermore, we provide benchmark performance for misinformation detection on tweet dataset. This new multimodal dataset can be used in research on COVID-19 vaccine, including misinformation detection, influence of fake COVID-19 vaccine information, etc.

翻译：COVID-19的爆发导致了“信息”的“信息”效应,鼓励传播关于COVID-19的错误信息以及治疗方法,而这反过来又会对在较大人口群体中采用建议公共卫生措施产生消极影响。在本文中,我们提供了一个新的多式联运(包括图像、文本和时间信息)标签数据集,其中载有关于COVID-19疫苗的新闻文章和推特推文。我们从2020年2月16日至2021年5月8日之间,从80个出版商收集了2 593篇新闻文章,在一年的时间里,我们从2021年4月17日至2021年5月8日,以及24184个推特站点(收集了2021年4月17日至5月8日)收集了这些文章。我们把三个新闻媒体排名站点的评级结合起来:媒体Bias Chart、News Guards和Media Bias/Fact Check(MBFFFC),将新闻数据集分为两个层次:可靠和不可靠。我们还提议了一个立场检测机制机制,将推文推文评分为三个层次的准确度数据。我们提供一些统计数据,用以进行实时数据。