KuaiRec:建议系统全天观察数据集 (KuaiRec: A Fully-observed Dataset for Recommender Systems)

Recommender systems are usually developed and evaluated on the historical user-item logs. However, most offline recommendation datasets are highly sparse and contain various biases, which hampers the evaluation of recommendation policies. Existing efforts aim to improve the data quality by collecting users' preferences on randomly selected items (e.g., Yahoo! and Coat). However, they still suffer from the high variance issue caused by the sparsely observed data. To fundamentally solve the problem, we present KuaiRec, a fully-observed dataset collected from the social video-sharing mobile App, Kuaishou. The feedback of 1,411 users on almost all of the 3,327 videos is explicitly observed. To the best of our knowledge, this is the first real-world fully-observed dataset with millions of user-item interactions in recommendation. To demonstrate the advantage of KuaiRec, we leverage it to explore the key questions in evaluating conversational recommender systems. The experimental results show that two factors in traditional partially-observed data -- the data density and the exposure bias -- greatly affect the evaluation results. This entails the significance of our fully-observed data in researching many directions in recommender systems, e.g., the unbiased recommendation, interactive/conversational recommendation, and evaluation. We release the dataset and the pipeline implementation for evaluation at https://chongminggao.github.io/KuaiRec/.

翻译：建议系统通常是在历史用户项目日志上开发和评价的。然而,大多数离线建议数据集高度稀少,含有各种偏见,妨碍了对建议政策的评价。现有努力的目的是通过收集用户对随机选择项目(如Yahoo!和Coat)的偏好来提高数据质量。然而,它们仍然由于观测到的数据少而导致的高度差异问题而受到影响。为了从根本上解决问题,我们介绍了从社会视频共享移动应用程序KuaiRec(KuaiRec)中收集的完全可见的数据集。1,411个用户对几乎所有3,327视频的反馈被明确观察到。据我们所知,这是第一个与数百万用户项目互动的完全观测到的真实世界数据集。为了展示KuaiRec的优势,我们利用它来探讨评价谈话建议系统的关键问题。实验结果表明,传统半隐蔽数据的两个因素 -- -- 数据密度和暴露的偏差 -- -- 严重影响了评价结果。这需要我们完全观测到的数据在研究中的重要性,我们提出了许多数据系统的建议。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

史上最全！358篇机器学习&自然语言处理综述论文！都这儿了

专知会员服务

129+阅读 · 2020年7月18日