Collaborative filtering models based on matrix factorization and learned similarities using Artificial Neural Networks (ANNs) have gained significant attention in recent years. This is, in part, because ANNs have demonstrated good results in a wide variety of recommendation tasks. The introduction of ANNs within the recommendation ecosystem has been recently questioned, raising several comparisons in terms of efficiency and effectiveness. One aspect most of these comparisons have in common is their focus on accuracy, neglecting other evaluation dimensions important for the recommendation, such as novelty, diversity, or accounting for biases. We replicate experiments from three papers that compare Neural Collaborative Filtering (NCF) and Matrix Factorization (MF), to extend the analysis to other evaluation dimensions. Our contribution shows that the experiments are entirely reproducible, and we extend the study including other accuracy metrics and two statistical hypothesis tests. We investigated the Diversity and Novelty of the recommendations, showing that MF provides a better accuracy also on the long tail, although NCF provides a better item coverage and more diversified recommendations. We discuss the bias effect generated by the tested methods. They show a relatively small bias, but other recommendation baselines, with competitive accuracy performance, consistently show to be less affected by this issue. This is the first work, to the best of our knowledge, where several evaluation dimensions have been explored for an array of SOTA algorithms covering recent adaptations of ANNs and MF. Hence, we show the potential these techniques may have on beyond-accuracy evaluation while analyzing the effect on reproducibility these complementary dimensions may spark. Available at github.com/sisinflab/Reenvisioning-the-comparison-between-Neural-Collaborative-Filtering-and-Matrix-Factorization
翻译:以矩阵因素为基础的协作过滤模型,以及利用人工神经网络(ANNS)所学到的相似之处,近年来引起了人们的极大关注。这部分原因是,ANNS在广泛的建议任务中显示了良好的结果。在建议生态系统中引入ANNS最近受到质疑,这在效率和有效性方面引起了若干比较。这些比较的一个共同方面是,它们侧重于准确性,忽视了对建议很重要的其他评价层面,例如新颖性、多样性或对偏差的会计。我们复制了比较神经合作过滤(NCF)和矩阵补充化(MF)的三份论文的实验,将分析扩大到其他评价层面。我们的贡献表明,实验完全可以复制,我们在建议生态系统中引入了其他精确度指标和两个统计假设测试。我们研究了建议的多样性和新颖性,表明MF公司在长尾部上也提供了更高的准确性,尽管NCF公司提供了更好的项目覆盖范围和更加多样化的建议。我们讨论了测试方法所产生的准确性影响。我们首先显示了相对小的偏差性,但其它建议级计算方法显示的是,这些具有竞争力的精确性,从这些评估的基线一直显示,从这个角度看,从这个阶段到SOFA-RO-LO-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-L-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I-I