Crowdsourcing has emerged as an alternative solution for collecting large scale labels. However, the majority of recruited workers are not domain experts, so their contributed labels could be noisy. In this paper, we propose a two-stage model to predict the true labels for multicategory classification tasks in crowdsourcing. In the first stage, we fit the observed labels with a latent factor model and incorporate subgroup structures for both tasks and workers through a multi-centroid grouping penalty. Group-specific rotations are introduced to align workers with different task categories to solve multicategory crowdsourcing tasks. In the second stage, we propose a concordance-based approach to identify high-quality worker subgroups who are relied upon to assign labels to tasks. In theory, we show the estimation consistency of the latent factors and the prediction consistency of the proposed method. The simulation studies show that the proposed method outperforms the existing competitive methods, assuming the subgroup structures within tasks and workers. We also demonstrate the application of the proposed method to real world problems and show its superiority.
翻译:众包已成为收集大型标签的替代解决办法。然而,大多数招聘的工人不是域名专家,因此他们贡献的标签可能会吵闹。在本文中,我们提出一个两阶段模型,预测众包中多类分类任务的真正标签。在第一阶段,我们将观察到的标签配上潜在因素模型,并通过多子类类处罚将任务和工人的分组结构纳入其中。引入了针对具体集团的轮换,以使工人与不同任务类别工人协调,解决多类群包任务。在第二阶段,我们建议采取基于协调的办法,确定需要指定任务的高质量工人分组。理论上,我们显示了潜在因素的估计一致性和拟议方法的预测一致性。模拟研究表明,拟议方法超越了现有竞争方法,在任务和工人中假定分组结构。我们还展示了拟议方法在现实世界问题中的应用情况,显示了其优越性。