Relevance judgment of human assessors is inherently subjective and dynamic when evaluation datasets are created for Information Retrieval (IR) systems. However, a small group of experts' relevance judgment results are usually taken as ground truth to "objectively" evaluate the performance of the IR systems. Recent trends intend to employ a group of judges, such as outsourcing, to alleviate the potentially biased judgment results stemmed from using only a single expert's judgment. Nevertheless, different judges may have different opinions and may not agree with each other, and the inconsistency in human relevance judgment may affect the IR system evaluation results. In this research, we introduce a Relevance Judgment Convergence Degree (RJCD) to measure the quality of queries in the evaluation datasets. Experimental results reveal a strong correlation coefficient between the proposed RJCD score and the performance differences between the two IR systems.
翻译:在为信息检索系统建立评价数据集时,人类评估员的相关性判断本质上是主观和动态的,然而,一小批专家的相关性判断结果通常被视为“客观”评估IR系统业绩的“事实”依据。最近的趋势打算雇用一组法官,如外包,以缓解仅使用单一专家的判断而可能产生的偏差判断结果。然而,不同的法官可能持有不同的意见,可能彼此不一致,而人类相关性判断的不一致可能会影响IR系统的评价结果。在这项研究中,我们采用了一个相关性判断一致度(RJCD)来衡量评价数据集查询的质量。实验结果表明,拟议的RJCD评分与两个IR系统的业绩差异之间有着很强的关联系数。