Social media, as a means for computer-mediated communication, has been extensively used to study the sentiment expressed by users around events or topics. There is however a gap in the longitudinal study of how sentiment evolved in social media over the years. To fill this gap, we develop TM-Senti, a new large-scale, distantly supervised Twitter sentiment dataset with over 184 million tweets and covering a time period of over seven years. We describe and assess our methodology to put together a large-scale, emoticon- and emoji-based labelled sentiment analysis dataset, along with an analysis of the resulting dataset. Our analysis highlights interesting temporal changes, among others in the increasing use of emojis over emoticons. We publicly release the dataset for further research in tasks including sentiment analysis and text classification of tweets. The dataset can be fully rehydrated including tweet metadata and without missing tweets thanks to the archive of tweets publicly available on the Internet Archive, which the dataset is based on.
翻译:作为计算机中介通信的手段,社会媒体被广泛用于研究用户围绕事件或主题表达的情绪。然而,关于多年来社交媒体中情绪变化的纵向研究存在差距。为了填补这一差距,我们开发了TM-Senti,这是一个新的大规模、远方监督的Twitter情绪数据集,有1.84亿多条推文,覆盖7年多的时间。我们描述并评估了我们集成大规模、以表情和表情为基础的贴标签情绪分析数据集的方法,并分析了由此产生的数据集。我们的分析凸显出在越来越多地使用表情上的情绪变化等有趣的时间变化。我们公开发布数据集,用于进一步研究任务,包括情绪分析和推文分类。数据集可以完全补水,包括推文元数据,而且不会遗漏推文,因为数据集是以因特网档案为基础而公开提供的推文档案。