Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets.
翻译:创建大型高品质标签数据集是受监督的机器学习工作流程中的一个主要瓶颈。 自动标签系统是减少对数据集构建中人工标签的依赖的一个大瓶颈。 以阈值为基础的自动标签系统是减少对人工标签的依赖的一个大有希望的方法。 从人类获得的验证数据用于找到信任的门槛,而数据是机器标签的,从人类那里获得的验证数据是用来寻找信任的临界值,正在成为实践中广泛使用的一种流行解决办法。 其次,基于阈值的自动标签系统的隐藏底部可能令人望而却步的验证数据使用。 在这项工作中,我们分析基于阈值的自动标签系统,并从保证机器标签数据质量所需的人类标签验证数据数量中得出样本复杂性界限。 我们通过模拟和效果研究,验证我们的理论保证。