Offline evaluation is a popular approach to determine the best algorithm in terms of the chosen quality metric. However, if the chosen metric calculates something unexpected, this miscommunication can lead to poor decisions and wrong conclusions. In this paper, we thoroughly investigate quality metrics used for recommender systems evaluation. We look at the practical aspect of implementations found in modern RecSys libraries and at the theoretical aspect of definitions in academic papers. We find that Precision is the only metric universally understood among papers and libraries, while other metrics may have different interpretations. Metrics implemented in different libraries sometimes have the same name but measure different things, which leads to different results given the same input. When defining metrics in an academic paper, authors sometimes omit explicit formulations or give references that do not contain explanations either. In 47% of cases, we cannot easily know how the metric is defined because the definition is not clear or absent. These findings highlight yet another difficulty in recommender system evaluation and call for a more detailed description of evaluation protocols.
翻译:离线评估是确定所选质量衡量标准方面最佳算法的流行方法。 但是,如果所选衡量标准计算出出出意想不到的东西,这种错误的沟通可能导致错误的决定和错误的结论。 在本文中,我们彻底调查用于建议者系统评价的质量衡量标准。我们研究了现代RecSys图书馆中发现的实际执行方面,以及学术论文中定义的理论方面。我们发现,精确度是文件和图书馆中唯一普遍理解的衡量标准,而其他衡量标准可能有不同的解释。在不同图书馆中执行的计量标准有时具有相同的名称,但衡量不同的结果,结果也不同。在学术文件中界定衡量标准时,作者有时省略了明确表述,或给出的参考,但也没有解释。在47%的案例中,我们不易知道该定义是如何界定的,因为定义不明确或缺失。这些结论突出了推荐系统评价的另一个困难,要求更详细地描述评价程序。