标签是否总有必要进行分类准确性评价? (Are Labels Always Necessary for Classifier Accuracy Evaluation?)

To calculate the model accuracy on a computer vision task, e.g., object recognition, we usually require a test set composing of test samples and their ground truth labels. Whilst standard usage cases satisfy this requirement, many real-world scenarios involve unlabeled test data, rendering common model evaluation methods infeasible. We investigate this important and under-explored problem, Automatic model Evaluation (AutoEval). Specifically, given a labeled training set and a classifier, we aim to estimate the classification accuracy on unlabeled test datasets. We construct a meta-dataset: a dataset comprised of datasets generated from the original images via various transformations such as rotation, background substitution, foreground scaling, etc. As the classification accuracy of the model on each sample (dataset) is known from the original dataset labels, our task can be solved via regression. Using the feature statistics to represent the distribution of a sample dataset, we can train regression models (e.g., a regression neural network) to predict model performance. Using synthetic meta-dataset and real-world datasets in training and testing, respectively, we report a reasonable and promising prediction of the model accuracy. We also provide insights into the application scope, limitation, and potential future direction of AutoEval.

翻译：为了计算计算机视觉任务模型的准确性,例如,物体识别,我们通常需要一套由测试样品及其地面真实标签组成的测试数据集。虽然标准使用案例满足了这一要求,但许多真实世界情景涉及未贴标签的测试数据,使得通用模型评价方法不可行。我们调查了这个重要和探索不足的问题,自动模型评价(AutoEval)。具体地说,根据一个标签式的培训组和一个分类器,我们的目标是估计未贴标签测试数据集的分类准确性。我们建立了一个元数据集:一套数据集,由通过轮换、背景替代、地表缩放等各种变换从原始图像中产生的数据集组成。由于每个样本模型(数据集)的分类准确性从原始数据集标签中知道,我们的任务可以通过回归来解决。我们利用特征统计来代表抽样数据集的分布,我们可以培训回归模型(例如回归神经网络)来预测模型性能。我们使用合成的元数据集和真实世界模型集,在培训和测试中分别提供我们未来方向的准确度,我们报告一个合理的和有前景的准确性。