Today, groundtruth generation relies on datasets annotated by cloud-based annotation services. These rely on human annotation, which can be prohibitively expensive. In this paper, we consider the problem of hybrid human-machine labeling, which trains a classifier to accurately auto-label part of the data set. However, training the classifier can be expensive too. We propose an iterative approach that minimizes total overall cost by, at each step, jointly determining which samples to label using humans and which to label using the trained classifier. We validate our approach on well known public data sets such as Fashion-MNIST, CIFAR-10, CIFAR-100, and ImageNet. In some cases, our approach has 6x lower overall cost relative to human labeling the entire dataset, and is always cheaper than the cheapest competing strategy.
翻译:今天,地面真实的生成依赖基于云的批注服务所附加的数据集。 这些数据依赖人类的批注,而人类的批注可能非常昂贵。 在本文中,我们考虑到混合的人体机器标签问题,它训练一个分类员来准确地标出数据集的自动标签部分。然而,培训分类员也可能很昂贵。我们建议一种迭代方法,通过在每一个步骤上共同确定哪些标出使用人类的标出,哪些标出使用训练有素的分类师。我们验证了我们对众所周知的公共数据集(如时装-MNIST、CIFAR-10、CIFAR-100和图像网络)采用的方法。 在某些情况下,我们的方法比人类标出整个数据集的总成本低6x,而且总是比最廉价的竞价策略便宜。