The widespread usage of social networks during mass convergence events, such as health emergencies and disease outbreaks, provides instant access to citizen-generated data that carry rich information about public opinions, sentiments, urgent needs, and situational reports. Such information can help authorities understand the emergent situation and react accordingly. Moreover, social media plays a vital role in tackling misinformation and disinformation. This work presents TBCOV, a large-scale Twitter dataset comprising more than two billion multilingual tweets related to the COVID-19 pandemic collected worldwide over a continuous period of more than one year. More importantly, several state-of-the-art deep learning models are used to enrich the data with important attributes, including sentiment labels, named-entities (e.g., mentions of persons, organizations, locations), user types, and gender information. Last but not least, a geotagging method is proposed to assign country, state, county, and city information to tweets, enabling a myriad of data analysis tasks to understand real-world issues. Our sentiment and trend analyses reveal interesting insights and confirm TBCOV's broad coverage of important topics.
翻译:在大规模趋同事件(如卫生紧急情况和疾病爆发)期间广泛使用社交网络,可以即时获取公民生成的数据,这些数据包含关于公众意见、情绪、紧急需要和情况报告的丰富信息。这些信息有助于当局了解突发情况并做出相应反应。此外,社交媒体在解决错误信息和虚假信息方面发挥着至关重要的作用。这项工作介绍了TBCOV,这是一个大型的Twitter数据集,由连续一年多的时间里收集到的与COVID-19大流行病有关的20多亿多条多语种推特组成。更重要的是,一些最先进的深层次学习模式被用来丰富重要属性的数据,包括情绪标签、名称实体(例如提及个人、组织、地点)、用户类型和性别信息。最后但并非最不重要的一点是,建议采用地理拖拉的方法将国家、州、县和城市信息指定为推特,从而能够完成无数的数据分析任务,以了解现实世界问题。我们的感知和趋势分析揭示了有趣的洞察,并确认TBCOVVVVVD对重要专题的广泛覆盖。