Real-world data for classification is often labeled by multiple annotators. For analyzing such data, we introduce CROWDLAB, a straightforward approach to estimate: (1) A consensus label for each example that aggregates the individual annotations (more accurately than aggregation via majority-vote or other algorithms used in crowdsourcing); (2) A confidence score for how likely each consensus label is correct (via well-calibrated estimates that account for the number of annotations for each example and their agreement, prediction-confidence from a trained classifier, and trustworthiness of each annotator vs. the classifier); (3) A rating for each annotator quantifying the overall correctness of their labels. While many algorithms have been proposed to estimate related quantities in crowdsourcing, these often rely on sophisticated generative models with iterative inference schemes, whereas CROWDLAB is based on simple weighted ensembling. Many algorithms also rely solely on annotator statistics, ignoring the features of the examples from which the annotations derive. CROWDLAB in contrast utilizes any classifier model trained on these features, which can generalize between examples with similar features. In evaluations on real-world multi-annotator image data, our proposed method provides superior estimates for (1)-(3) than many alternative algorithms.
翻译:在分析这些数据时,我们采用CROWDLAB, 这是一种直截了当的估算方法:(1) 对每个综合个别说明的示例(比通过多数选票或其他在众包中使用的算法进行汇总更为准确的汇总)的一致标签;(2) 对每个共识标签的正确可能性的可信度评分(通过精确的估算,计算出每个示例的说明数量及其协议,由受过训练的分类员提供预测-信心,以及每个说明员与分类员之间的可信度);(3) 对每个说明员进行评级,量化其标签的总体正确性。虽然已提议采用许多算法来估算众包购的相关数量,但这些算法往往依赖复杂的基因模型,并采用迭接的推论方法,而CROWDLAB则以简单的加权混集法为基础。许多算法也完全依赖说明者的统计数据,而忽略了说明所引出的例子的特征。CROWDLAB则使用任何关于这些特征的分类模型,这些模型可以概括出与我们所拟的高级模型之间的示例。(1) 在现实世界中,许多算法还提供许多替代的模型。