COVID-19 Twitter数据集,包含潜在主题、情感和情感属性 (COVID-19 Twitter Dataset with Latent Topics, Sentiments and Emotions Attributes)

from arxiv, The latest dataset version (V12, June 2022) has the following main updates: a) Full data coverage extended to cover 28 January 2020 - 1 June 2022 (2 years and 4 months), b) Country-specific CSV files download covers 30 representative countries, c) Added new vaccine-related data covering from 3 November 2021 to 1 June 2022 (8 months), d) an updated discussion on the dataset's usage

This paper describes a large global dataset on people's discourse and responses to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 June 2022, we collected and processed over 252 million Twitter posts from more than 29 million unique users using four keywords: "corona", "wuhan", "nCov" and "covid". Leveraging probabilistic topic modelling and pre-trained machine learning-based emotion recognition algorithms, we labelled each tweet with seventeen attributes, including a) ten binary attributes indicating the tweet's relevance (1) or irrelevance (0) to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: extremely negative to 1: extremely positive) and the degree of intensity of fear, anger, sadness and happiness emotions (from 0: not at all to 1: extremely intense), and c) two categorical attributes indicating the sentiment (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion (fear, anger, sadness, happiness, no specific emotion) the tweet is mainly expressing. We discuss the technical validity and report the descriptive statistics of these attributes, their temporal distribution, and geographic representation. The paper concludes with a discussion of the dataset's usage in communication, psychology, public health, economics, and epidemiology.

翻译：从2020年1月28日至2022年6月1日,我们收集并处理了来自超过2 900万个独特用户的超过2.52亿个推特留言,使用四个关键词:“corona”、“wuhan”、“nCov”和“covd”。我们利用概率论模型和预先训练的机器学习情感识别算法,将每个推特贴上17个属性的标签,包括:(a) 10个二进制属性,表明该推特与所检测的十大主题的相关性(1)或无关(0),(b) 5个定量情感属性,显示价值或情绪的强度(从0:极负至1:极正),以及恐惧、愤怒、悲伤和快乐情绪的强度(从0:不至1:极端强烈),以及(c) 两种直截的属性,显示情绪(非常消极、消极、中立或混合、积极、非常积极)和占支配地位的情感(感官、愤怒、悲伤、幸福、快乐、没有具体情感),这五个定量情感特征情感特征特征,表明其价值或情绪的强度程度(从0:极低到极低到极消极),我们讨论其地理、地理学数据分布。