网上知识共享中大型丰富背景查询和建议数据集 (A Large-Scale Rich Context Query and Recommendation Dataset in Online Knowledge-Sharing)

Data plays a vital role in machine learning studies. In the research of recommendation, both user behaviors and side information are helpful to model users. So, large-scale real scenario datasets with abundant user behaviors will contribute a lot. However, it is not easy to get such datasets as most of them are only hold and protected by companies. In this paper, a new large-scale dataset collected from a knowledge-sharing platform is presented, which is composed of around 100M interactions collected within 10 days, 798K users, 165K questions, 554K answers, 240K authors, 70K topics, and more than 501K user query keywords. There are also descriptions of users, answers, questions, authors, and topics, which are anonymous. Note that each user's latest query keywords have not been included in previous open datasets, which reveal users' explicit information needs. We characterize the dataset and demonstrate its potential applications for recommendation study. Multiple experiments show the dataset can be used to evaluate algorithms in general top-N recommendation, sequential recommendation, and context-aware recommendation. This dataset can also be used to integrate search and recommendation and recommendation with negative feedback. Besides, tasks beyond recommendation, such as user gender prediction, most valuable answerer identification, and high-quality answer recognition, can also use this dataset. To the best of our knowledge, this is the largest real-world interaction dataset for personalized recommendation.

翻译：在机器学习研究中,数据在机器学习研究中发挥着关键作用。在建议研究中,用户行为和侧面信息都对模拟用户有帮助。因此,大规模真实情景数据集与大量用户行为将有很大贡献。然而,获得这类数据集并非易事,因为大多数数据集仅由公司持有和保护。在本文中,介绍了从知识共享平台收集的新的大型数据集,由10天内收集的大约100M互动组成,798K用户、165K问题、554K答案、240K作者、70K主题和501K用户查询关键词组成。还有用户、答案、问题、作者和主题的描述,这些都是匿名的。注意每个用户的最新查询关键字没有包括在以前的开放数据集中,这些关键字显示了用户的明确信息需求。我们描述数据集,并展示其可能用于建议研究的应用程序。多个实验显示,数据集可用于评估一般上层建议、顺序建议、背景建议和环境意识建议中的算法。这一数据集还可以被最充分地用于将个人质量预测、最宝贵的数据反馈与最宝贵的预测结合起来。此外,还利用这种最宝贵的数据定义和最宝贵的预测,还用作最宝贵的预测。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【大规模机器学习】综述论文，20页pdf，A Survey on Large-scale Machine

专知会员服务

66+阅读 · 2020年8月13日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日

【SIGIR2020】多检索系统的贝叶斯推理风险评估，Bayesian Inferential Risk Evaluation On Multiple IR Systems

专知会员服务

9+阅读 · 2020年6月10日