For machine learning models trained with limited labeled training data, validation stands to become the main bottleneck to reducing overall annotation costs. We propose a statistical validation algorithm that accurately estimates the F-score of binary classifiers for rare categories, where finding relevant examples to evaluate on is particularly challenging. Our key insight is that simultaneous calibration and importance sampling enables accurate estimates even in the low-sample regime (< 300 samples). Critically, we also derive an accurate single-trial estimator of the variance of our method and demonstrate that this estimator is empirically accurate at low sample counts, enabling a practitioner to know how well they can trust a given low-sample estimate. When validating state-of-the-art semi-supervised models on ImageNet and iNaturalist2017, our method achieves the same estimates of model performance with up to 10x fewer labels than competing approaches. In particular, we can estimate model F1 scores with a variance of 0.005 using as few as 100 labels.
翻译:对于经有限标签培训数据培训的机器学习模型,验证将成为降低总体批注成本的主要瓶颈。我们提议了一个统计验证算法,准确估计稀有类别二进制分类器的F-Scream,其中找到相关实例来评估尤其具有挑战性。我们的关键见解是,同时校准和重要取样能够甚至在低抽样制度( < 300个样本)中准确估算出准确的估计数。关键是,我们还得出一个准确的单审估计方法差异的单审估计器,并表明这个估计器在低抽样点上是实证准确的,使一名执业者能够知道他们如何相信某个低抽样估计数。当验证图像网络和iNaturallist(2017)上最先进的半监督模型模型时,我们的方法可以达到同样的模型性能估计,其标签比竞争方法少10x。特别是,我们可以用100个标签来估计模式F1分,相差0.005。