Researchers use recall to evaluate rankings across a variety of retrieval, recommendation, and machine learning tasks. While there is a colloquial interpretation of recall in set-based evaluation, the research community is far from a principled understanding of recall metrics for rankings. The lack of principled understanding of or motivation for recall has resulted in criticism amongst the retrieval community that recall is useful as a measure at all. In this light, we reflect on the measurement of recall in rankings from a formal perspective. Our analysis is composed of three tenets: recall, robustness, and lexicographic evaluation. First, we formally define `recall-orientation' as sensitivity to movement of the bottom-ranked relevant item. Second, we analyze our concept of recall orientation from the perspective of robustness with respect to possible searchers and content providers. Finally, we extend this conceptual and theoretical treatment of recall by developing a practical preference-based evaluation method based on lexicographic comparison. Through extensive empirical analysis across 17 TREC tracks, we establish that our new evaluation method, lexirecall, is correlated with existing recall metrics and exhibits substantially higher discriminative power and stability in the presence of missing labels. Our conceptual, theoretical, and empirical analysis substantially deepens our understanding of recall and motivates its adoption through connections to robustness and fairness.
翻译:研究人员利用回顾来评估各种检索、建议和机算学习任务中的排名。虽然在基于定置的评价中对重新召回作了学术解释,但研究界远未对重新召回的排名指标有原则性的理解。缺乏原则性的理解或重新召回的动机,导致检索界对重新召回的批评,而重新召回是一个有用的措施。从这一点出发,我们思考从正式角度衡量重新召回的排名的方法。我们的分析由三项原则组成:回顾、稳健和词汇学评估。首先,我们正式将“重新召回方向”定义为对排名底的相关项目移动的敏感性。第二,我们从对可能的搜索者和内容提供者的稳健性角度分析我们的重新召回方向概念。最后,我们扩展了这种概念和理论上的处理方法,根据地谱比较制定了一种基于实际偏好的评价方法。我们通过对17个TREC轨道进行广泛的实证分析,确定我们的新评价方法(Lexirecallcall)与现有的重新召回回调的衡量方法和证据性强得多的判断力和稳定性和稳定性与缺乏的正确性联系,我们从理论上、理论、理论和深刻地理解和深刻的印象和深刻的印象分析。