Analyzing classification model performance is a crucial task for machine learning practitioners. While practitioners often use count-based metrics derived from confusion matrices, like accuracy, many applications, such as weather prediction, sports betting, or patient risk prediction, rely on a classifier's predicted probabilities rather than predicted labels. In these instances, practitioners are concerned with producing a calibrated model, that is, one which outputs probabilities that reflect those of the true distribution. Model calibration is often analyzed visually, through static reliability diagrams, however, the traditional calibration visualization may suffer from a variety of drawbacks due to the strong aggregations it necessitates. Furthermore, count-based approaches are unable to sufficiently analyze model calibration. We present Calibrate, an interactive reliability diagram that addresses the aforementioned issues. Calibrate constructs a reliability diagram that is resistant to drawbacks in traditional approaches, and allows for interactive subgroup analysis and instance-level inspection. We demonstrate the utility of Calibrate through use cases on both real-world and synthetic data. We further validate Calibrate by presenting the results of a think-aloud experiment with data scientists who routinely analyze model calibration.
翻译:分析分类模型性能是机器学习实践者的一项关键任务。虽然从业者经常使用来自混乱矩阵的基于计算的指标,例如准确性,但许多应用,例如天气预测、体育赌博或病人风险预测,都依赖于分类者的预测概率而不是预测标签。在这些情况下,从业者关心的是制作一个校准模型,也就是说,该模型的输出概率反映真实分布的概率。模型校准往往通过静态的可靠性图进行视觉分析,但传统校准可视化可能因强力汇总而出现各种缺陷。此外,基于计算的方法无法充分分析模型校准。我们提出一个互动的可靠性图,用以解决上述问题。Calibrate制作了一个可靠性图,无法在传统方法中进行反射,并能够进行互动的分组分析和实例级检查。我们通过使用真实世界和合成数据的案例来证明校准校准的效用。我们通过向定期分析模型校准的数据科学家介绍思想实验的结果进一步验证校准。