Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.
翻译:获取高质量标注数据集通常成本高昂,需要人工标注或昂贵的实验。理论上,强大的预训练AI模型为自动标注数据集和节约成本提供了可能。然而,这些模型无法保证其准确性,使得完全替代人工标注并不现实。本研究提出一种利用预训练AI模型构建经济高效且高质量数据集的方法。特别地,我们的方法能够产生概率近似正确的标注:在较高概率下,整体标注误差保持较小。该方法在数据集或所研究AI模型的最小假设条件下具有非渐近有效性,从而能够利用现代AI模型实现严谨而高效的数据集构建。我们通过大型语言模型的文本标注、预训练视觉模型的图像标注以及AlphaFold的蛋白质折叠分析,验证了该方法的优势。