Inter-rater reliability (IRR) is one of the commonly used tools for assessing the quality of ratings from multiple raters as it is easily obtainable from the observed ratings themselves. However, applicant selection procedures based on ratings from multiple raters usually result in a binary outcome; the applicant is either selected or not. This final outcome is not considered in IRR, which instead focuses on the ratings of the individual subjects or objects. In this work, we outline the connection between the ratings' measurement model (used for IRR) and a binary classification framework. We develop a quantile approximation which allows us to estimate the probability of correctly selecting the best applicants and compute error probabilities of the selection procedure (i.e., false-positive and false-negative rate) under the assumption of the ratings' validity. If the ratings are not completely valid, the computed error probabilities correspond to a lower bound on the true error probabilities. We draw connections between the inter-rater reliability and the binary classification metrics, showing that binary classification metrics depend solely on the IRR coefficient and proportion of selected applicants. We assess the performance of the quantile approximation in a simulation study and apply it in an example comparing the reliability of multiple grant peer review selection procedures.
翻译:摘要:相互评价可靠性(IRR)是评估多个评价者评分质量的常用工具,因为它可以轻松从观察到的评分中获得。然而,基于多个评分者的评分的申请人选择程序通常会导致二元结果;申请人要么被选择,要么不被选择。IRR不考虑最终结果,而是关注个体主题或对象的评分。在本文中,我们概述了评分测量模型(用于IRR)与二进制分类框架之间的关系。我们开发了一个分位数逼近方法,使我们能够估计正确选择最佳申请人的概率,并在假设评分有效的情况下计算选择程序的错误概率(即,假阳性率和假阴性率)。如果评分不完全有效,则计算的错误概率对应于真实错误概率的下限。我们将相互评价可靠性和二进制分类指标之间的联系联系起来,表明二进制分类指标仅取决于IRR系数和选定申请人的比例。我们在模拟研究中评估了分位数逼近的性能,并在一个例子中应用于比较多个授予同行评审选择程序的可靠性。