Various studies in recent years have pointed out large issues in the offline evaluation of recommender systems, making it difficult to assess whether true progress has been made. However, there has been little research into what set of practices should serve as a starting point during experimentation. In this paper, we examine four larger issues in recommender system research regarding uncertainty estimation, generalization, hyperparameter optimization and dataset pre-processing in more detail to arrive at a set of guidelines. We present a TrainRec, a lightweight and flexible toolkit for offline training and evaluation of recommender systems that implements these guidelines. Different from other frameworks, TrainRec is a toolkit that focuses on experimentation alone, offering flexible modules that can be can be used together or in isolation. Finally, we demonstrate TrainRec's usefulness by evaluating a diverse set of twelve baselines across ten datasets. Our results show that (i) many results on smaller datasets are likely not statistically significant, (ii) there are at least three baselines that perform well on most datasets and should be considered in future experiments, and (iii) improved uncertainty quantification (via nested CV and statistical testing) rules out some reported differences between linear and neural methods. Given these results, we advocate that future research should standardize evaluation using our suggested guidelines.
翻译:近些年来的各种研究都指出建议者系统离线评价中的大问题,使得难以评估是否取得了真正的进展;然而,对于哪些做法应该成为试验的起点,研究很少。在本文件中,我们研究了建议者系统关于不确定性估计、一般化、超光谱优化和数据元预处理的研究的四个大问题,以更详细地得出一套指导方针。我们提出了一个“训练”、一个用于离线培训和评价执行这些准则的建议者系统的轻量和灵活的工具包。与其他框架不同,“训练”是一个仅侧重于试验的工具包,提供灵活模块,可以一起或单独使用。最后,我们通过评估十套数据集的12个不同基线来证明“训练”的效用。我们的结果表明,(一) 较小数据集的许多结果可能并不具有统计意义,(二) 至少有三个基准在大多数数据集上很好地运行,今后试验中应当考虑,以及(三) 改进不确定性的量化(通过嵌巢式CV和统计测试)和统计测试,以便利用我们提出的线性评估结果,我们报告了一些标准化。