In medical image classification tasks, it is common to find that the number of normal samples far exceeds the number of abnormal samples. In such class-imbalanced situations, reliable training of deep neural networks continues to be a major challenge. Under these circumstances, the predicted class probabilities may be biased toward the majority class. Calibration has been suggested to alleviate some of these effects. However, there is insufficient analysis explaining when and whether calibrating a model would be beneficial in improving performance. In this study, we perform a systematic analysis of the effect of model calibration on its performance on two medical image modalities, namely, chest X-rays and fundus images, using various deep learning classifier backbones. For this, we study the following variations: (i) the degree of imbalances in the dataset used for training; (ii) calibration methods; and (iii) two classification thresholds, namely, default decision threshold of 0.5, and optimal threshold from precision-recall curves. Our results indicate that at the default operating threshold of 0.5, the performance achieved through calibration is significantly superior (p < 0.05) to using uncalibrated probabilities. However, at the PR-guided threshold, these gains are not significantly different (p > 0.05). This finding holds for both image modalities and at varying degrees of imbalance.
翻译:在医学图像分类任务中,常见的做法是发现正常样本的数量远远超过异常样本的数量。在这种等级平衡的情况下,对深神经网络的可靠培训仍然是一个重大挑战。在这种情况下,预测的等级概率可能偏向于多数类。建议校准以缓解其中的一些影响。然而,没有充分的分析来解释对模型进行校准的时间和是否有利于改进性能。在这项研究中,我们对模型校准对两种医学图像模式,即胸部X射线和Fundus图像的性能的影响进行系统分析,使用各种深层学习分类主干线。在这方面,我们研究以下差异:(一) 用于培训的数据集的不平衡程度;(二) 校准方法;(三) 两个分类阈值,即0.5的默认决定阈值,以及精确回调曲线的最佳阈值。我们的结果显示,在0.5的默认操作阈值,通过校准实现的性能显著优于(p < 0.05),使用无法校准的校准性能标定的准性能。但是,在这种水平上,正正率的平衡度均处于不同的水平。