We study the practical consequences of dataset sampling strategies on the performance of recommendation algorithms. Recommender systems are generally trained and evaluated on samples of larger datasets. Samples are often taken in a naive or ad-hoc fashion: e.g. by sampling a dataset randomly or by selecting users or items with many interactions. As we demonstrate, commonly-used data sampling schemes can have significant consequences on algorithm performance -- masking performance deficiencies in algorithms or altering the relative performance of algorithms, as compared to models trained on the complete dataset. Following this observation, this paper makes the following main contributions: (1) characterizing the effect of sampling on algorithm performance, in terms of algorithm and dataset characteristics (e.g. sparsity characteristics, sequential dynamics, etc.); and (2) designing SVP-CF, which is a data-specific sampling strategy, that aims to preserve the relative performance of models after sampling, and is especially suited to long-tail interaction data. Detailed experiments show that SVP-CF is more accurate than commonly used sampling schemes in retaining the relative ranking of different recommendation algorithms.
翻译:我们研究了数据集抽样战略对建议算法绩效的实际后果。建议系统一般在较大数据集样本上接受培训和评价。抽样往往以天真或临时的方式进行:例如随机抽样数据集,或选择用户或项目,进行许多互动。我们证明,通常使用的数据抽样方案可能对算法绩效产生重大影响 -- -- 掩盖算法中的性能缺陷,或改变算法的相对性能,与就完整数据集所培训的模型相比。在此观察之后,本文件作出以下主要贡献:(1) 说明抽样对算法绩效的影响,包括算法和数据集特征(例如孔径特征、顺序动态等);(2) 设计SVP-CF,这是一项针对具体数据的抽样战略,目的是在取样后保持模型的相对性能,特别适合长尾互动数据。详细实验表明,SVP-CF比通常使用的抽样方案更准确,以保留不同建议算法的相对等级。