Assessors make preference judgments faster and more consistently than graded judgments. Preference judgments can also recognize distinctions between items that appear equivalent under graded judgments. Unfortunately, preference judgments can require more than linear effort to fully order a pool of items, and evaluation measures for preference judgments are not as well established as those for graded judgments, such as NDCG. In this paper, we explore the assessment process for partial preference judgments, with the aim of identifying and ordering the top items in the pool, rather than fully ordering the entire pool. To measure the performance of a ranker, we compare its output to this preferred ordering by applying a rank similarity measure.We demonstrate the practical feasibility of this approach by crowdsourcing partial preferences for the TREC 2019 Conversational Assistance Track, replacing NDCG with a new measure named "compatibility". This new measure has its most striking impact when comparing modern neural rankers, where it is able to recognize significant improvements in quality that would otherwise be missed by NDCG.
翻译:评估人比分级判决更快、更一致地作出优惠判决。 优惠判决还可以区分等级判决中看起来等同的项目。 不幸的是,优惠判决要求的不仅仅是线性努力来充分订购一批项目,而优惠判决的评价措施没有像NDCG这样的分级判决那样完全确定。 在本文中,我们探讨了部分优惠判决的评估程序,目的是确定和订购池内最顶级的项目,而不是完全订购整个集合。为了衡量一个排级者的绩效,我们通过适用一个类比措施,将其产出与首选的排序进行比较。 我们通过为TREC 2019年交替援助轨道提供部分优惠,以名为“兼容性”的新措施取代NDCG,从而证明这一办法的实际可行性。 在比较现代神经排级时,这一新措施具有最显著的影响,因为它能够确认质量方面的重大改进,否则NDCG会错过。