来自坏模式的良好数据:基于阈值的自动标签基础 (Good Data from Bad Models : Foundations of Threshold-based Auto-labeling)

Creating large-scale high-quality labeled datasets is a major bottleneck in supervised machine learning workflows. Auto-labeling systems are a promising way to reduce reliance on manual labeling for dataset construction. Threshold-based auto-labeling, where validation data obtained from humans is used to find a threshold for confidence above which the data is machine-labeled, is emerging as a popular solution used widely in practice. Given the long shelf-life and diverse usage of the resulting datasets, understanding when the data obtained by such auto-labeling systems can be relied on is crucial. In this work, we analyze threshold-based auto-labeling systems and derive sample complexity bounds on the amount of human-labeled validation data required for guaranteeing the quality of machine-labeled data. Our results provide two insights. First, reasonable chunks of the unlabeled data can be automatically and accurately labeled by seemingly bad models. Second, a hidden downside of threshold-based auto-labeling systems is potentially prohibitive validation data usage. Together, these insights describe the promise and pitfalls of using such systems. We validate our theoretical guarantees with simulations and study the efficacy of threshold-based auto-labeling on real datasets.

翻译：创建大型高品质标签数据集是受监督的机器学习工作流程中的一个主要瓶颈。自动标签系统是减少对数据集构建中人工标签的依赖的一个大瓶颈。以阈值为基础的自动标签系统是减少对人工标签的依赖的一个大有希望的方法。从人类获得的验证数据用于找到信任的门槛,而数据是机器标签的,从人类那里获得的验证数据是用来寻找信任的临界值,正在成为实践中广泛使用的一种流行解决办法。其次,基于阈值的自动标签系统的隐藏底部可能令人望而却步的验证数据使用。在这项工作中,我们分析基于阈值的自动标签系统,并从保证机器标签数据质量所需的人类标签验证数据数量中得出样本复杂性界限。我们通过模拟和效果研究,验证我们的理论保证。

相关内容

MoDELS

关注 43

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【开放书】卡耐基梅隆大学Elaine Shi 教授《Foundations of Distributed Consensus and Blockchains（分布式共识和区块链的基础）》150页pdf

专知会员服务

30+阅读 · 2022年2月22日