Crowdsourcing provides an efficient label collection schema for supervised machine learning. However, to control annotation cost, each instance in the crowdsourced data is typically annotated by a small number of annotators. This creates a sparsity issue and limits the quality of machine learning models trained on such data. In this paper, we study how to handle sparsity in crowdsourced data using data augmentation. Specifically, we propose to directly learn a classifier by augmenting the raw sparse annotations. We implement two principles of high-quality augmentation using Generative Adversarial Networks: 1) the generated annotations should follow the distribution of authentic ones, which is measured by a discriminator; 2) the generated annotations should have high mutual information with the ground-truth labels, which is measured by an auxiliary network. Extensive experiments and comparisons against an array of state-of-the-art learning from crowds methods on three real-world datasets proved the effectiveness of our data augmentation framework. It shows the potential of our algorithm for low-budget crowdsourcing in general.
翻译:然而,为了控制批注成本,众源数据中的每个实例通常都由少数批注员附加说明。这造成了一个偏狭问题,并限制了关于这些数据的机器学习模型的质量。在本文中,我们研究如何利用数据扩增处理众源数据中的偏狭问题。具体地说,我们提议通过增加原始稀释说明直接学习一个分类器。我们用基因反versarial网络执行两个高质量增强的原则:1) 生成的注释应该遵循真实数据的分配,由歧视者加以衡量;2) 生成的注释应该具有与地面真相标签的高度相互信息,而地面真相标签则由一个辅助网络加以衡量。对三个真实世界数据集的人群从人群中学习的最新方法进行广泛的实验和比较,证明了我们数据增强框架的有效性。它显示了我们对于一般的低预算人群外包的算法的潜力。