Recommendation systems are often evaluated based on user's interactions that were collected from an existing, already deployed recommendation system. In this situation, users only provide feedback on the exposed items and they may not leave feedback on other items since they have not been exposed to them by the deployed system. As a result, the collected feedback dataset that is used to evaluate a new model is influenced by the deployed system, as a form of closed loop feedback. In this paper, we show that the typical offline evaluation of recommender systems suffers from the so-called Simpson's paradox. Simpson's paradox is the name given to a phenomenon observed when a significant trend appears in several different sub-populations of observational data but disappears or is even reversed when these sub-populations are combined together. Our in-depth experiments based on stratified sampling reveal that a very small minority of items that are frequently exposed by the deployed system plays a confounding factor in the offline evaluation of recommendation systems. In addition, we propose a novel evaluation methodology that takes into account the confounder, i.e the deployed system's characteristics. Using the relative comparison of many recommendation models as in the typical offline evaluation of recommender systems, and based on the Kendall rank correlation coefficient, we show that our proposed evaluation methodology exhibits statistically significant improvements of 14% and 40% on the examined open loop datasets (Yahoo! and Coat), respectively, in reflecting the true ranking of systems with an open loop (randomised) evaluation in comparison to the standard evaluation.
翻译:建议系统通常根据从已经部署的现有建议系统中收集的用户互动情况进行评估,在这种情况下,用户只对暴露的物品提供反馈,而且由于被部署的系统没有向用户披露,他们可能不会对其他物品留下反馈,因此,所收集的用于评价新模式的反馈数据集受到部署系统的影响,作为一种闭路反馈的形式。在本文中,我们显示对推荐系统典型的离线评价有所谓的辛普森悖论。辛普森的悖论是,当观测数据的若干子群出现重大趋势,但当这些子群合并在一起时,它们可能消失,甚至被逆转。我们根据分层抽样进行的深入试验显示,作为封闭循环反馈反馈的一种形式,被部署的系统经常暴露在离线评价建议系统的离线评价中,少数项目具有混杂因素。此外,我们建议采用一种新的评价方法,即结合组合,即部署系统的特征。 利用许多建议模型的相对比较,这些比较,这些次级组群将消失,这些分级评价将显示我们典型的基调评估方法中的重要比率。