With the increased interest in machine learning and big data problems, the need for large amounts of labelled data has also grown. However, it is often infeasible to get experts to label all of this data, which leads many practitioners to crowdsourcing solutions. In this paper, we present new techniques to improve the quality of the labels while attempting to reduce the cost. The naive approach to assigning labels is to adopt a majority vote method, however, in the context of data labelling, this is not always ideal as data labellers are not equally reliable. One might, instead, give higher priority to certain labellers through some kind of weighted vote based on past performance. This paper investigates the use of more sophisticated methods, such as Bayesian inference, to measure the performance of the labellers as well as the confidence of each label. The methods we propose follow an iterative improvement algorithm which attempts to use the least amount of workers necessary to achieve the desired confidence in the inferred label. This paper explores simulated binary classification problems with simulated workers and questions to test the proposed methods. Our methods outperform the standard voting methods in both cost and accuracy while maintaining higher reliability when there is disagreement within the crowd.
翻译:随着对机器学习和大数据问题的日益关注,对大量贴标签数据的需求也增加了。然而,往往不宜让专家给所有这些数据贴上标签,这导致许多从业者寻找众包解决方案。在本文中,我们介绍了提高标签质量的新技术,同时试图降低成本。在数据标签方面,分配标签的天真的方法是采用多数投票方法,然而,在数据标签方面,这并不总是理想的,因为数据标签员并不同样可靠。人们可能会通过基于过去表现的某种加权投票,给予某些标签员更高的优先地位。本文调查了如何使用更先进的方法,如Bayesian推论,以衡量标签员的性能以及每个标签的可信度。我们建议的方法是采用迭代式改进算法,试图使用最低数量的工人来获得对推断标签的信任。本文探讨了模拟工人的二进制分类问题和测试拟议方法的问题。我们的方法在成本和准确度上都超过了标准投票方法,同时在人群内部出现分歧时保持更高的可靠性。