This paper describes a large global dataset on people's social media responses to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 September 2021, we collected over 198 million Twitter posts from more than 25 million unique users using four keywords: "corona", "wuhan", "nCov" and "covid". Leveraging topic modeling techniques and pre-trained machine learning-based emotion analytic algorithms, we labeled each tweet with seventeen semantic attributes, including a) ten binary attributes indicating the tweet's relevance or irrelevance to the top ten detected topics, b) five quantitative emotion attributes indicating the degree of intensity of the valence or sentiment (from 0: very negative to 1: very positive), and the degree of intensity of fear, anger, happiness and sadness emotions (from 0: not at all to 1: extremely intense), and c) two qualitative attributes indicating the sentiment category (very negative, negative, neutral or mixed, positive, very positive) and the dominant emotion category (fear, anger, happiness, sadness, no specific emotion) the tweet is mainly expressing. We report the descriptive statistics around these new attributes, their temporal distributions, and the overall geographic representation of the dataset. The paper concludes with an outline of the dataset's possible usage in communication, psychology, public health, economics, and epidemiology.
翻译:从2020年1月28日至2021年9月1日,我们从超过2 500万个独特用户收集了超过1.98亿个Twitter讯息,使用四个关键词:“corona”、“wuhan”、“nCov”和“covd”。我们利用主题模型技术以及预先训练的基于情感分析的机器分析算法,将每条推特贴上17个语义属性的标签,包括:(a) 10个二进制属性,表明该推特与所检测的十大主题的相关性或无关;(b) 5个数量情感属性,显示其价值或情绪的强度(从0:非常消极到1:非常积极),以及恐惧、愤怒、幸福和悲伤情绪的强度(从0:不完全到1:极端紧张),以及(c)两个定性属性,显示情绪类别(非常消极、中、中或混合、积极、非常积极),以及占支配地位的情感类别(感官、愤怒、幸福、悲伤、没有具体情感),这5个数量情感属性表示其价值或情绪的程度(从0:非常消极到1:非常积极),显示其价值或情绪的强度程度程度程度程度程度程度程度(从0:非常消极) 以及总体地理分布图中,我们报告了这些地理分布图示。