The progress of recommender systems is hampered mainly by evaluation as it requires real-time interactions between humans and systems, which is too laborious and expensive. This issue is usually approached by utilizing the interaction history to conduct offline evaluation. However, existing datasets of user-item interactions are partially observed, leaving it unclear how and to what extent the missing interactions will influence the evaluation. To answer this question, we collect a fully-observed dataset from Kuaishou's online environment, where almost all 1,411 users have been exposed to all 3,327 items. To the best of our knowledge, this is the first real-world fully-observed data with millions of user-item interactions. With this unique dataset, we conduct a preliminary analysis of how the two factors - data density and exposure bias - affect the evaluation results of multi-round conversational recommendation. Our main discoveries are that the performance ranking of different methods varies with the two factors, and this effect can only be alleviated in certain cases by estimating missing interactions for user simulation. This demonstrates the necessity of the fully-observed dataset. We release the dataset and the pipeline implementation for evaluation at https://kuairec.com
翻译:推荐人系统的进展主要受到评价的阻碍,因为它要求人类和系统之间进行实时互动,这太费力和昂贵。这个问题通常通过利用互动历史进行离线评价来解决。但是,现有的用户-项目互动的数据集被部分观察到,使用户-项目互动的现有数据集不清楚如何和在何种程度上影响评价。为了回答这个问题,我们从Kuaishou的在线环境中收集了一套完全可见的数据集,几乎所有的1,411个用户都接触了全部3,327个物品。据我们所知,这是第一个与数以百万计的用户-项目互动完全观测的真实世界数据。我们利用这一独特的数据集,对数据密度和接触偏差这两个因素如何影响多轮谈话建议的评价结果进行了初步分析。我们的主要发现是,不同方法的性能等级因两种因素而不同,而只能通过估计用户模拟的缺失的相互作用来减轻这种影响。这显示了充分观测数据集的必要性。我们发布了数据集和在 https://kuairearec 进行评价的管道。