A variety of different performance metrics are commonly used in the machine learning literature for the evaluation of classification systems. Some of the most common ones for measuring quality of hard decisions are standard and balanced accuracy, standard and balanced error rate, F-beta score, and Matthews correlation coefficient (MCC). In this document, we review the definition of these and other metrics and compare them with the expected cost (EC), a metric introduced in every statistical learning course but rarely used in the machine learning literature. We show that both the standard and balanced error rates are special cases of the EC. Further, we show its relation with F-score and MCC and argue that EC is superior to these traditional metrics, being more elegant, general, and intuitive, as well as being based on basic principles from statistics. The metrics above measure the quality of hard decisions. Yet, most modern classification systems output continuous scores for the classes which we may want to evaluate directly. Metrics for measuring the quality of system scores include the area under the ROC curve, equal error rate, cross-entropy, Brier score, and Bayes EC or Bayes risk, among others. The last three metrics are special cases of a family of metrics given by the expected value of proper scoring rules (PSRs). We review the theory behind these metrics and argue that they are the most principled way to measure the quality of the posterior probabilities produced by a system. Finally, we show how to use these metrics to compute the system's calibration loss and compare this metric with the standard expected calibration error (ECE), arguing that calibration loss based on PSRs is superior to the ECE for a variety of reasons.
翻译:在评估分类系统的机器学习文献中,通常使用各种不同的业绩计量。衡量硬决定质量的一些最常见的标准是标准、平衡的准确率、标准和平衡的误差率、F-beta评分和Matthews相关系数(MCC)。在本文件中,我们审查了这些指标和其他指标的定义,并将其与预期成本(EC)进行比较,这是每个统计学习课程中引入的、但很少在机器学习文献中使用的衡量标准。我们表明,标准和平衡的误差率都是欧盟委员会的特殊案例。此外,我们展示了与F-score和MCC的关系,并论证EC优于这些传统指标、更优雅、更一般和平衡的误差率、F-beta评分和平衡的误差率,以及基于统计数据的基本原则。以上指标衡量了硬性决定的质量。然而,大多数现代分类系统为我们可能想要直接评估的类别得出连续的评分数。衡量系统评分的尺度包括了REC质量曲线下的区域、等误率、跨通性评分、Brier EC和Bayes或Bayes的优等的比值。 。 比较优于这些传统衡量标准的风险,以及基于统计的基本原理的基本原则原则的衡量标准。我们最后的评分数是衡量标准的三度标准的数值,这是衡量标准的数值的数值的推算。我们衡量标准的排序的推算。我们最后的排序的推算,这是衡量标准的推算。