The profusion of user generated content caused by the rise of social media platforms has enabled a surge in research relating to fields such as information retrieval, recommender systems, data mining and machine learning. However, the lack of comprehensive baseline data sets to allow a thorough evaluative comparison has become an important issue. In this paper we present a large data set of news items from well-known aggregators such as Google News and Yahoo! News, and their respective social feedback on multiple platforms: Facebook, Google+ and LinkedIn. The data collected relates to a period of 8 months, between November 2015 and July 2016, accounting for about 100,000 news items on four different topics: economy, microsoft, obama and palestine. This data set is tailored for evaluative comparisons in predictive analytics tasks, although allowing for tasks in other research areas such as topic detection and tracking, sentiment analysis in short text, first story detection or news recommendation.
翻译:由于社交媒体平台的崛起,用户生成的内容大量涌现,使得与信息检索、建议系统、数据挖掘和机器学习等领域有关的研究激增,然而,缺乏全面基线数据集以进行彻底的评价比较已成为一个重要的问题。在本文件中,我们介绍了谷歌新闻和Yahoo!新闻等众所周知的聚合者提供的大量新闻项目数据,以及它们在多个平台(Facebook、Google+和LinkedIn)上各自的社会反馈。所收集的数据涉及2015年11月至2016年7月期间的8个月时间,涉及经济、微软、Obama和Palestine这四个不同主题的大约10万个新闻项目,这些数据集是专门为预测分析任务方面的评价比较而设计的,但允许在专题探测和跟踪、短文本情绪分析、首次故事探测或新闻建议等其他研究领域的任务。