Preference judgments have been demonstrated as a better alternative to graded judgments to assess the relevance of documents relative to queries. Existing work has verified transitivity among preference judgments when collected from trained judges, which reduced the number of judgments dramatically. Moreover, strict preference judgments and weak preference judgments, where the latter additionally allow judges to state that two documents are equally relevant for a given query, are both widely used in literature. However, whether transitivity still holds when collected from crowdsourcing, i.e., whether the two kinds of preference judgments behave similarly remains unclear. In this work, we collect judgments from multiple judges using a crowdsourcing platform and aggregate them to compare the two kinds of preference judgments in terms of transitivity, time consumption, and quality. That is, we look into whether aggregated judgments are transitive, how long it takes judges to make them, and whether judges agree with each other and with judgments from TREC. Our key findings are that only strict preference judgments are transitive. Meanwhile, weak preference judgments behave differently in terms of transitivity, time consumption, as well as of quality of judgment.
翻译:现有工作核实了从受过训练的法官那里收集的优惠判决的过渡性,从而大大减少了判决的数量;此外,严格的优惠判决和薄弱的优惠判决(后者又允许法官说两份文件对某一询问具有同等意义)在文献中广泛使用;然而,在从众包中收集时,过渡性是否仍然有效,即两种优惠判决是否同样表现不甚明确;在这项工作中,我们利用一个众包平台收集多位法官的判决,并汇总这些判决,以比较两种类型的优惠判决:过渡性、时间消耗和质量。这就是,我们研究综合判决是否具有过渡性,法官作出这些判决需要多长时间,法官是否彼此同意,以及法官是否同意TREC的判决。我们的主要结论是,只有严格的优惠判决才具有过渡性。与此同时,在过渡性、时间消耗和判决质量方面,薄弱的优惠判决表现不同。