Recent advances in machine learning have led to increased deployment of black-box classifiers across a wide variety of applications. In many such situations there is a critical need to both reliably assess the performance of these pre-trained models and to perform this assessment in a label-efficient manner (given that labels may be scarce and costly to collect). In this paper, we introduce an active Bayesian approach for assessment of classifier performance to satisfy the desiderata of both reliability and label-efficiency. We begin by developing inference strategies to quantify uncertainty for common assessment metrics such as accuracy, misclassification cost, and calibration error. We then propose a general framework for active Bayesian assessment using inferred uncertainty to guide efficient selection of instances for labeling, enabling better performance assessment with fewer labels. We demonstrate significant gains from our proposed active Bayesian approach via a series of systematic empirical experiments assessing the performance of modern neural classifiers (e.g., ResNet and BERT) on several standard image and text classification datasets.
翻译:最近机器学习的进展导致在各种各样的应用中更多地部署黑盒分类器,在许多这类情况下,迫切需要同时可靠地评估这些经过预先训练的模式的性能,并以标签效率的方式进行这一评估(因为标签可能稀缺,收集成本较高)。在本文件中,我们采用了一种积极的贝叶斯方法来评估分类器的性能,以满足可靠性和标签效率的偏差。我们首先制定推论战略,对诸如准确性、分类错误成本和校准错误等共同评估指标的不确定性进行量化。然后,我们提出一个积极的贝叶斯评估总框架,利用推断的不确定性来指导有效选择标签的事例,以便能够用较少的标签进行更好的业绩评估。我们通过对若干标准图像和文本分类数据集的现代神经分类器(例如ResNet和BERT)的性能进行一系列系统的经验性实验,从我们拟议的海湾积极方法中显示出显著的收益。