社交媒体情绪的消化:收集和分析纵向推特情绪数据集 (The emojification of sentiment on social media: Collection and analysis of a longitudinal Twitter sentiment dataset)

Social media, as a means for computer-mediated communication, has been extensively used to study the sentiment expressed by users around events or topics. There is however a gap in the longitudinal study of how sentiment evolved in social media over the years. To fill this gap, we develop TM-Senti, a new large-scale, distantly supervised Twitter sentiment dataset with over 184 million tweets and covering a time period of over seven years. We describe and assess our methodology to put together a large-scale, emoticon- and emoji-based labelled sentiment analysis dataset, along with an analysis of the resulting dataset. Our analysis highlights interesting temporal changes, among others in the increasing use of emojis over emoticons. We publicly release the dataset for further research in tasks including sentiment analysis and text classification of tweets. The dataset can be fully rehydrated including tweet metadata and without missing tweets thanks to the archive of tweets publicly available on the Internet Archive, which the dataset is based on.

翻译：作为计算机中介通信的手段,社会媒体被广泛用于研究用户围绕事件或主题表达的情绪。然而,关于多年来社交媒体中情绪变化的纵向研究存在差距。为了填补这一差距,我们开发了TM-Senti,这是一个新的大规模、远方监督的Twitter情绪数据集,有1.84亿多条推文,覆盖7年多的时间。我们描述并评估了我们集成大规模、以表情和表情为基础的贴标签情绪分析数据集的方法,并分析了由此产生的数据集。我们的分析凸显出在越来越多地使用表情上的情绪变化等有趣的时间变化。我们公开发布数据集,用于进一步研究任务,包括情绪分析和推文分类。数据集可以完全补水,包括推文元数据,而且不会遗漏推文,因为数据集是以因特网档案为基础而公开提供的推文档案。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【快讯】NeurIPS2020结果出炉，1900篇上榜，你的paper中了吗？

专知会员服务

54+阅读 · 2020年9月26日

20篇「ACL2020」最新论文抢先看！看自然语言处理2020在研究什么？

专知会员服务

97+阅读 · 2020年4月10日

图像分类技巧集，17页ppt《Bag of Tricks for Image Classification》

专知会员服务

96+阅读 · 2020年3月12日

【中科院信工所】社交媒体情感分析，40页ppt

专知会员服务

104+阅读 · 2019年12月13日