We introduce a new framework for sample-efficient model evaluation that we call active testing. While approaches like active learning reduce the number of labels needed for model training, existing literature largely ignores the cost of labeling test data, typically unrealistically assuming large test sets for model evaluation. This creates a disconnect to real applications, where test labels are important and just as expensive, e.g. for optimizing hyperparameters. Active testing addresses this by carefully selecting the test points to label, ensuring model evaluation is sample-efficient. To this end, we derive theoretically-grounded and intuitive acquisition strategies that are specifically tailored to the goals of active testing, noting these are distinct to those of active learning. As actively selecting labels introduces a bias; we further show how to remove this bias while reducing the variance of the estimator at the same time. Active testing is easy to implement and can be applied to any supervised machine learning method. We demonstrate its effectiveness on models including WideResNets and Gaussian processes on datasets including Fashion-MNIST and CIFAR-100.
翻译:我们引入了一种新的样本高效模型评估框架,我们称之为积极测试。虽然积极学习等方法减少了模型培训所需的标签数量,但现有文献基本上忽略了标签测试数据的成本,通常不切实际地假设模型评估所需的大型测试组。这造成了与实际应用程序的脱节,测试标签在其中很重要,同样昂贵,例如用于优化超参数。积极测试通过仔细选择测试点来解决这个问题,确保模型评估具有样本效率。为此,我们得出了专门针对积极测试目标的理论基础和直观获取战略,注意到这些战略与积极学习的战略不同。随着积极选择标签引入了一种偏差;我们进一步展示了如何消除这种偏差,同时减少估计值的差异。积极测试易于实施,并可用于任何受监督的机器学习方法。我们展示了其在包括FAshion-MNIST和CIFAR-100等数据集上的宽ResNet和高萨进程在内的模型的有效性。