In academic research, recommender models are often evaluated on offline datasets. The offline dataset is first split to training and test instances. All training instances are then modelled in a user-item interaction matrix which can be used to train recommender models. Many such offline evaluations ignore the global timeline in the data, which leads to "data leakage": a model learns from future data to predict a current value, making the evaluation unrealistic. In this paper, we evaluate the impact of "data leakage" using two widely adopted baseline models, BPR and NeuMF, on four popular offline datasets - MovieLens-25M, Yelp, Amazon-music, and Amazon-electronic. We show that accessing to different amount of future data may improve or deteriorate a model's recommendation accuracy. That is, ignoring global timeline in offline evaluation makes the performance among recommendation models not comparable. We share our understanding of these observations and highlight the importance of preserving the global timeline. We also call for a revisit of recommender system offline evaluation.
翻译:在学术研究中,推荐人模型往往在离线数据集中进行评估。离线数据集首先分为培训和测试实例。然后,所有培训实例都以用户-项目互动矩阵为模型模型,可用于培训推荐人模型。许多此类离线评价忽略了数据中的全球时间表,导致“数据泄漏”:一个模型从未来数据中学习,以预测当前价值,使评价不现实。在本文件中,我们使用两个广泛采用的基准模型,即BPR和NeuMF, 来评估“数据泄漏”的影响,这四个离线通用数据集,即MoviceLens-25M、Yelp、Amazon-音乐和亚马孙-电子。我们表示,获取不同数量的未来数据可能改进或恶化模式建议准确性。这就是,忽略离线评价的全球时间表,使得建议模型的绩效无法比较。我们分享我们对这些意见的理解,并强调保护全球时间表的重要性。我们还呼吁重新审视推荐人系统离线评价。