As larger and more comprehensive datasets become standard in contemporary machine learning, it becomes increasingly more difficult to obtain reliable, trustworthy label information with which to train sophisticated models. To address this problem, crowdsourcing has emerged as a popular, inexpensive, and efficient data mining solution for performing distributed label collection. However, crowdsourced annotations are inherently untrustworthy, as the labels are provided by anonymous volunteers who may have varying, unreliable expertise. Worse yet, some participants on commonly used platforms such as Amazon Mechanical Turk may be adversarial, and provide intentionally incorrect label information without the end user's knowledge. We discuss three conventional models of the label generation process, describing their parameterizations and the model-based approaches used to solve them. We then propose OpinionRank, a model-free, interpretable, graph-based spectral algorithm for integrating crowdsourced annotations into reliable labels for performing supervised or semi-supervised learning. Our experiments show that OpinionRank performs favorably when compared against more highly parameterized algorithms. We also show that OpinionRank is scalable to very large datasets and numbers of label sources, and requires considerably fewer computational resources than previous approaches.
翻译:随着更大、更全面的数据集在当代机器学习中成为标准,越来越难以获得可靠、可靠的标签信息,用以培训尖端模型。为了解决这一问题,众包已成为进行分布式标签收集的流行、廉价、高效的数据挖掘解决方案。然而,众包说明本质上不可信,因为标签是由匿名志愿者提供的,他们可能具有不同、不可靠的专业知识。更糟糕的是,一些在诸如亚马逊机械土耳其等常用平台上的参与者可能是对立的,并且提供有意不正确的标签信息,而没有最终用户的知识。我们讨论了三个传统标签生成模型,描述其参数化和用于解决这些参数的基于模型的方法。我们随后提出了“意见”模型,一种无模型、可解释的、基于图表的光谱算法,将众包说明纳入可靠的标签,以便进行监督或半监督的学习。我们的实验表明,与更高度参数化的算法相比,“意见”系统可以发挥优势。我们还表明,“意见”系统可以缩放到非常庞大的数据集和标签源数,需要比以往少得多的计算资源。