In the domains of dataset construction and crowdsourcing, a notable challenge is to aggregate labels from a heterogeneous set of labelers, each of whom is potentially an expert in some subset of tasks (and less reliable in others). To reduce costs of hiring human labelers or training automated labeling systems, it is of interest to minimize the number of labelers while ensuring the reliability of the resulting dataset. We model this as the problem of performing $K$-class classification using the predictions of smaller classifiers, each trained on a subset of $[K]$, and derive bounds on the number of classifiers needed to accurately infer the true class of an unlabeled sample under both adversarial and stochastic assumptions. By exploiting a connection to the classical set cover problem, we produce a near-optimal scheme for designing such configurations of classifiers which recovers the well known one-vs.-one classification approach as a special case. Experiments with the MNIST and CIFAR-10 datasets demonstrate the favorable accuracy (compared to a centralized classifier) of our aggregation scheme applied to classifiers trained on subsets of the data. These results suggest a new way to automatically label data or adapt an existing set of local classifiers to larger-scale multiclass problems.
翻译:在数据集构建和众包领域,一个值得注意的挑战是如何汇总一组不同标签标签的标签,其中每个标签可能是某组任务的专家(而在另一些方面则不那么可靠),为了降低雇用人类标签师或培训自动标签系统的费用,有必要尽量减少标签者的数量,同时确保由此产生的数据集的可靠性。我们将此作为使用小型分类器预测进行1-v.-one分类的问题模型,每个分类器都受过一定的 $[K] 的一组培训,并且从分类器的数量中得出必要的界限,以精确推算在对抗和随机假设下未加标签样本的真实类别。为了利用与经典标签组的连接来覆盖问题,我们制定了一种近于最佳的办法来设计这种分类器配置,以恢复众所周知的1-v.-s.-one分类法作为特例。与MNIST和CIFAR-10数据集进行的实验表明,在对数据分类器的分类组分级员进行培训后,对本地分类器的分类器的分类器将采用比较准确性(与中央分类器比较)。