In this paper, we present ArCOV-19, an Arabic COVID-19 Twitter dataset that spans one year, covering the period from 27th of January 2020 till 31st of January 2021. ArCOV-19 is the first publicly-available Arabic Twitter dataset covering COVID-19 pandemic that includes about 2.7M tweets alongside the propagation networks of the most-popular subset of them (i.e., most-retweeted and -liked). The propagation networks include both retweets and conversational threads (i.e., threads of replies). ArCOV-19 is designed to enable research under several domains including natural language processing, information retrieval, and social computing. Preliminary analysis shows that ArCOV-19 captures rising discussions associated with the first reported cases of the disease as they appeared in the Arab world. In addition to the source tweets and propagation networks, we also release the search queries and language-independent crawler used to collect the tweets to encourage the curation of similar datasets.
翻译:在本文中,我们介绍ARCOV-19,这是阿拉伯文的COVID-19 Twitter数据集,为期一年,涵盖时间为2020年1月27日至2021年1月31日。ArCOV-19是第一个公开提供的涵盖COVID-19大流行病的阿拉伯推特数据集,其中包括大约2.7M Twitter,以及其中最广的传播网络(即最受质疑和最受喜爱的网络),传播网络包括雷特维特和对话线(即答复线)。ArCOV-19旨在在几个领域进行研究,包括自然语言处理、信息检索和社会计算。初步分析显示ArCOV-19捕捉到与在阿拉伯世界出现的首批报告疾病病例有关的越来越多的讨论。除了来源的推特和传播网络外,我们还发布搜索查询和依赖语言的爬行器,用来收集推特,以鼓励类似数据集的曲解。