We study the practical consequences of dataset sampling strategies on the ranking performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance. Following this observation, this paper makes three main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc.); (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tailed interaction data; and (3) developing an oracle, Data-Genie, which can suggest the sampling scheme that is most likely to preserve model performance for a given dataset. The main benefit of Data-Genie is that it will allow recommender system practitioners to quickly prototype and compare various approaches, while remaining confident that algorithm performance will be preserved, once the algorithm is retrained and deployed on the complete data. Detailed experiments show that using Data-Genie, we can discard upto 5x more data than any sampling strategy with the same level of performance.
翻译:我们研究了数据集抽样战略对建议算法绩效排名的实际后果。建议系统一般经过培训,对较大数据集的样本进行评估。抽样往往以天真或临时的方式进行:例如随机抽样数据集,或选择用户或项目,有许多互动。我们证明,常用的数据抽样方案可能对算法绩效产生重大影响。根据这项观察,本文件提出三个主要贡献:(1) 将抽样对算法绩效的影响定性为算法和数据集特征(例如孔径特征、相继动态等));(2) 设计SVP-CF,这是针对特定数据的抽样战略,目的是在取样后保持模型的相对性能,特别适合长期的交互数据;(3) 开发一个Ocle,即数据-Genie,它可以提出最有可能为特定数据集保存模型绩效的取样方案。Data-Genie的主要好处是,它将允许推荐的系统从业者迅速进行原型和比较各种方法,同时保持信心,在取样后将保持模型业绩与我们所部署的任何详细程度。