Estimating how well a machine learning model performs during inference is critical in a variety of scenarios (for example, to quantify uncertainty, or to choose from a library of available models). However, the standard accuracy estimate of softmax confidence is not versatile and cannot reliably predict different performance metrics (e.g., F1-score, recall) or the performance in different application scenarios or input domains. In this work, we systematically generalize performance estimation to a diverse set of metrics and scenarios and discuss generalized notions of uncertainty calibration. We propose the use of post-hoc models to accomplish this goal and investigate design parameters, including the model type, feature engineering, and performance metric, to achieve the best estimation quality. Emphasis is given to object detection problems and, unlike prior work, our approach enables the estimation of per-image metrics such as recall and F1-score. Through extensive experiments with computer vision models and datasets in three use cases -- mobile edge offloading, model selection, and dataset shift -- we find that proposed post-hoc models consistently outperform the standard calibrated confidence baselines. To the best of our knowledge, this is the first work to develop a unified framework to address different performance estimation problems for machine learning inference.
翻译:评估机器学习模型在推论期间的表现有多好,在各种假设情景中(例如,量化不确定性,或从现有模型的图书馆中作出选择)都是至关重要的。然而,软麦自信的标准准确性估计并不是多方面的,无法可靠地预测不同的性能指标(例如,F1-记数、回忆)或不同应用情景或输入领域的性能。在这项工作中,我们系统地将性能估计推广到一套不同的指标和假设情景,并讨论不确定性校准的普遍概念。我们提议使用后热模型来实现这一目标,并调查设计参数,包括模型类型、特征工程和性能衡量标准,以达到最佳的估算质量。强调针对探测问题,与以前的工作不同,我们的方法使得能够估计每个图像的度指标,如回想和F1-记数。通过对计算机视觉模型和数据集进行广泛的实验,在三种使用的案例 -- -- 移动边缘卸载、模型选择和数据集转换 -- -- 我们发现,拟议的后热模型始终高于标准校准的信任基线,包括模型类型、特征工程和性能衡量标准基准,以取得最佳的成绩框架。根据最佳的机能来判断。