There are strong incentives to build models that demonstrate outstanding predictive performance on various datasets and benchmarks. We believe these incentives risk a narrow focus on models and on the performance metrics used to evaluate and compare them -- resulting in a growing body of literature to evaluate and compare metrics. This paper strives for a more balanced perspective on classifier performance metrics by highlighting their distributions under different models of uncertainty and showing how this uncertainty can easily eclipse differences in the empirical performance of classifiers. We begin by emphasising the fundamentally discrete nature of empirical confusion matrices and show how binary matrices can be meaningfully represented in a three dimensional compositional lattice, whose cross-sections form the basis of the space of receiver operating characteristic (ROC) curves. We develop equations, animations and interactive visualisations of the contours of performance metrics within (and beyond) this ROC space, showing how some are affected by class imbalance. We provide interactive visualisations that show the discrete posterior predictive probability mass functions of true and false positive rates in ROC space, and how these relate to uncertainty in performance metrics such as Balanced Accuracy (BA) and the Matthews Correlation Coefficient (MCC). Our hope is that these insights and visualisations will raise greater awareness of the substantial uncertainty in performance metric estimates that can arise when classifiers are evaluated on empirical datasets and benchmarks, and that classification model performance claims should be tempered by this understanding.
翻译:在各种数据集和基准上,有强大的激励机制来建立模型,这些模型显示各种数据集和基准的杰出预测性业绩。我们认为,这些激励机制有可能狭隘地侧重于模型和用于评估和比较的业绩计量 -- -- 导致越来越多的文献用于评价和比较度量。本文件力求通过突出其在不同不确定性模型下的分布,并表明这种不确定性如何很容易掩盖分类员经验性业绩的差别,从而对分类员的业绩衡量尺度进行更平衡的审视。我们首先强调经验性混乱矩阵的基本离散性质,并表明二进制矩阵如何在三个维成份的矩阵中有意义地代表这些模型和用于评价和比较的业绩计量,这些矩阵的交叉部分构成了接收器操作特征曲线空间的基础。我们开发了等式、动画和交互式的分类方法,以更平衡性和互动的可视化视角来看待分类指标的轮廓。 我们提供互动的可视化可视化的图像,显示ROC空间真实和假正率率的模型的概率质量功能,以及这些与业绩衡量尺度的不确定性有何关联,例如平衡性和直观性估测度(C将提高我们的直观性数据)和直观性估测度。