Recommender models are hard to evaluate, particularly under offline setting. In this paper, we provide a comprehensive and critical analysis of the data leakage issue in recommender system offline evaluation. Data leakage is caused by not observing global timeline in evaluating recommenders, e.g., train/test data split does not follow global timeline. As a result, a model learns from the user-item interactions that are not expected to be available at prediction time. We first show the temporal dynamics of user-item interactions along global timeline, then explain why data leakage exists for collaborative filtering models. Through carefully designed experiments, we show that all models indeed recommend future items that are not available at the time point of a test instance, as the result of data leakage. The experiments are conducted with four widely used baseline models - BPR, NeuMF, SASRec, and LightGCN, on four popular offline datasets - MovieLens-25M, Yelp, Amazon-music, and Amazon-electronic, adopting leave-last-one-out data split. We further show that data leakage does impact models' recommendation accuracy. Their relative performance orders thus become unpredictable with different amount of leaked future data in training. To evaluate recommendation systems in a realistic manner in offline setting, we propose a timeline scheme, which calls for a revisit of the recommendation model design.
翻译:推荐人模型很难评估, 特别是在离线设置下。 在本文中, 我们对推荐人系统离线评价中的数据渗漏问题进行了全面和批判性的分析。 数据渗漏是由于在评价建议者时没有遵守全球时间表造成的, 例如, 火车/测试数据分割不遵循全球时间表。 因此, 模型从预测时无法提供的用户- 项目互动中学习到四个受欢迎的离线数据集 -- -- 电影- 25M、 Yelp、 亚马逊- 音乐和亚马逊- 电子, 采用协作过滤模型的数据渗漏。 我们通过精心设计的实验, 显示所有模型确实建议了在测试实例时间点没有的将来项目, 结果是数据渗漏。 它们的相对性能命令以四种广泛使用的基线模型- BPR、 NeuMF、 SASRec 和 LightGCN 进行实验, 四个受欢迎的离线数据集- 电影- 25M、 Yelp、 亚马孙- 音乐和亚马孙- 电子, 采用最后输出数据拆分解。 我们进一步显示数据渗漏对模型的建议准确性产生影响。 因此, 其相对性性性运行命令会变得无法预测未来设计计划。