We describe Mega-COV, a billion-scale dataset from Twitter for studying COVID-19. The dataset is diverse (covers 268 countries), longitudinal (goes as back as 2007), multilingual (comes in 100+ languages), and has a significant number of location-tagged tweets (~169M tweets). We release tweet IDs from the dataset. We also develop and release two powerful models, one for identifying whether or not a tweet is related to the pandemic (best F1=97%) and another for detecting misinformation about COVID-19 (best F1=92%). A human annotation study reveals the utility of our models on a subset of Mega-COV. Our data and models can be useful for studying a wide host of phenomena related to the pandemic. Mega-COV and our models are publicly available.
翻译:我们描述Mega-COV,这是用于研究COVID-19的来自Twitter的10亿尺度数据集。数据集多种多样(覆盖268个国家)、纵向(与2007年相同)、多语种(以100+种语言出现)和大量贴有位置标签的推文(~169M 推文),我们从数据集中发布推文ID。我们还开发并发布两种强大的模型,一种用于确定推文是否与该流行病相关(最佳F1=97%),另一种用于发现关于COVID-19的错误信息(最佳F1=92% ) 。一项人类注解研究揭示了我们模型在Mega-COV子子子子集中的实用性。我们的数据和模型可用于研究与该流行病有关的广泛现象。Mega-COV和我们的模型可以公开获取。