Obtaining high-quality labeled datasets is often costly, requiring either human annotation or expensive experiments. In theory, powerful pre-trained AI models provide an opportunity to automatically label datasets and save costs. Unfortunately, these models come with no guarantees on their accuracy, making wholesale replacement of manual labeling impractical. In this work, we propose a method for leveraging pre-trained AI models to curate cost-effective and high-quality datasets. In particular, our approach results in probably approximately correct labels: with high probability, the overall labeling error is small. Our method is nonasymptotically valid under minimal assumptions on the dataset or the AI model being studied, and thus enables rigorous yet efficient dataset curation using modern AI models. We demonstrate the benefits of the methodology through text annotation with large language models, image labeling with pre-trained vision models, and protein folding analysis with AlphaFold.
翻译:获取高质量的标注数据集通常成本高昂,需要人工标注或昂贵的实验。理论上,强大的预训练人工智能模型为自动标注数据集和节约成本提供了机会。然而,这些模型无法保证其准确性,使得完全替代人工标注并不现实。本研究提出一种利用预训练人工智能模型构建经济高效且高质量数据集的方法。具体而言,我们的方法能够产生概率近似正确的标注:即以高概率保证整体标注误差较小。该方法在仅需对数据集或所研究人工智能模型进行最小假设的前提下具有非渐近有效性,从而能够利用现代人工智能模型实现严谨而高效的数据集构建。我们通过大型语言模型的文本标注、预训练视觉模型的图像标注以及AlphaFold的蛋白质折叠分析,展示了该方法的优势。