The task of aggregating and denoising crowd-labeled data has gained increased significance with the advent of crowdsourcing platforms and massive datasets. We propose a permutation-based model for crowd labeled data that is a significant generalization of the classical Dawid-Skene model, and introduce a new error metric by which to compare different estimators. We derive global minimax rates for the permutation-based model that are sharp up to logarithmic factors, and match the minimax lower bounds derived under the simpler Dawid-Skene model. We then design two computationally-efficient estimators: the WAN estimator for the setting where the ordering of workers in terms of their abilities is approximately known, and the OBI-WAN estimator where that is not known. For each of these estimators, we provide non-asymptotic bounds on their performance. We conduct synthetic simulations and experiments on real-world crowdsourcing data, and the experimental results corroborate our theoretical findings.
翻译:随着众包平台和大量数据集的出现,聚类标签数据汇总和去除的任务变得更加重要。我们提出了一个基于变位模型的人群标签数据模型,这是古典Dawid-Skene模型的重要概括,并引入了用于比较不同估量器的新错误度量。我们为基于变位模型得出了全球迷你速率,该模型直达对数系数,并匹配了更简单的Dawid-Skee模型所得出的小最大下限。我们随后设计了两种计算效率高的测算器:大致了解工人能力排序的广域网估测器,以及未知的OBI-WAN测算器。对于每一个测算器,我们都提供其性能的不设防线。我们进行了合成模拟和实验真实世界群集数据,实验结果证实了我们的理论结论。