Data plays a vital role in machine learning studies. In the research of recommendation, both user behaviors and side information are helpful to model users. So, large-scale real scenario datasets with abundant user behaviors will contribute a lot. However, it is not easy to get such datasets as most of them are only hold and protected by companies. In this paper, a new large-scale dataset collected from a knowledge-sharing platform is presented, which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query keywords. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. Note that each user's latest query keywords have not been included in previous open datasets, which reveal users' explicit information needs. We characterize the dataset and demonstrate its potential applications for recommendation study. Multiple experiments show the dataset can be used to evaluate algorithms in general top-N recommendation, sequential recommendation, and context-aware recommendation. This dataset can also be used to integrate search and recommendation and recommendation with negative feedback. Besides, tasks beyond recommendation, such as user gender prediction, most valuable answerer identification, and high-quality answer recognition, can also use this dataset. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.
翻译:在机器学习研究中,数据在机器学习研究中发挥着关键作用。在建议研究中,用户行为和侧面信息都对模拟用户有帮助。因此,大规模真实情景数据集与大量用户行为将有很大贡献。然而,获得这类数据集并非易事,因为大多数数据集仅由公司持有和保护。在本文中,介绍了从知识共享平台收集的新的大型数据集,由10天内收集的大约100M互动组成,798K用户、165K问题、554K答案、240K作者、70K主题和501K用户查询关键词组成。还有用户、答案、问题、作者和主题的描述,这些都是匿名的。注意每个用户的最新查询关键字没有包括在以前的开放数据集中,这些关键字显示了用户的明确信息需求。我们描述数据集,并展示其可能用于建议研究的应用程序。多个实验显示,数据集可用于评估一般上层建议、顺序建议、背景建议和环境意识建议中的算法。这一数据集还可以被最充分地用于将个人质量预测、最宝贵的数据反馈与最宝贵的预测结合起来。此外,还利用这种最宝贵的数据定义和最宝贵的预测,还用作最宝贵的预测。