This paper presents a large, labelled dataset on people's responses and expressions related to the COVID-19 pandemic over the Twitter platform. From 28 January 2020 to 1 Jan 2021, we retrieved over 132 million public Twitter posts (i.e., tweets) from more than 20 million unique users using four keywords: "corona", "wuhan", "nCov" and "covid". Leveraging natural language processing techniques and pre-trained machine learning-based emotion analytic algorithms, we labelled each tweet with seventeen latent semantic attributes, including a) ten binary attributes indicating the tweet's relevance or irrelevance to the top ten detected topics, b) five quantitative emotion intensity attributes indicating the degree of intensity of the valence or sentiment (from extremely negative to extremely positive), and the degree of intensity of fear, of anger, of sadness and of joy emotions (from barely noticeable to extremely high intensity), and c) two qualitative attributes indicating the sentiment category and the dominant emotion category the tweet is mainly expressing. We report the descriptive statistics around the topic, sentiment and emotion attributes, and their temporal distributions, and discuss the dataset's possible usage in communication, psychology, public health, economics, and epidemiology research.
翻译:从2020年1月28日到2021年1月1日,我们从超过2 000万个独特用户处检索了超过1.32亿个公共推特文章(即推特),其中使用了四个关键词:“corona”、“wurhan”、“nCov”和“covd”。我们利用自然语言处理技巧和经过训练的机能学习的情感解析算法,用17种潜伏语义属性,包括(a) 10个二元属性来标注每条推特,表明该推特与所检测到的十大主题的相关性或不相干;b) 5个量化情感强度属性,表明其价值或情绪的强度(从极负到极正),以及恐惧、愤怒、悲伤和喜悦情绪的强度(从几乎不明显到极高的强度);c) 两种定性属性,表明该推文的情绪类别和主要情感类别。我们报告围绕该主题、情感和情感属性及其时间分布的描述性统计数据,并讨论经济研究、可能采用的数据、心理学和心理学。