We aim to quantitatively measure the practical usability of medical image segmentation models: to what extent, how often, and on which samples a model's predictions can be used/trusted. We first propose a measure, Correctness-Confidence Rank Correlation (CCRC), to capture how predictions' confidence estimates correlate with their correctness scores in rank. A model with a high value of CCRC means its prediction confidences reliably suggest which samples' predictions are more likely to be correct. Since CCRC does not capture the actual prediction correctness, it alone is insufficient to indicate whether a prediction model is both accurate and reliable to use in practice. Therefore, we further propose another method, Usable Region Estimate (URE), which simultaneously quantifies predictions' correctness and reliability of confidence assessments in one estimate. URE provides concrete information on to what extent a model's predictions are usable. In addition, the sizes of usable regions (UR) can be utilized to compare models: A model with a larger UR can be taken as a more usable and hence better model. Experiments on six datasets validate that the proposed evaluation methods perform well, providing a concrete and concise measure for the practical usability of medical image segmentation models. Code is made available at https://github.com/yizhezhang2000/ure.
翻译:我们的目标是从数量上衡量医疗图像分化模型的实际可用性:在什么程度、频率和哪些样本上可以使用/信任地使用/使用模型的预测。我们首先提出一种衡量方法,即“正确度-保密级关系关系(CCRC)”,以记录预测的信任估计与其等级的正确性分数如何相关;高价值的CCRC模型意味着其预测可靠性,可靠地表明哪些样品预测更可能正确;由于CCRC没有捕捉实际预测正确性,单是不足以表明一种预测模型是否准确和可靠供实践使用。因此,我们进一步提出另一种方法,即“可使用区域估计(URE)”,同时量化预测的准确性和可靠性。URE提供了具体信息,说明模型预测在多大程度上有用。此外,可用区域的规模(UR)可以用来比较模型:一个使用更大的UR的模型可以被视为一个更实用性更好的模型,因此也不足以表明实际使用。在6个数据库上进行实验的“可使用性区域(URE)估计(URE)估计(URE)”区域(URE)区域(URE)区域/CRestable)区域(Misqhillable)区域(Misqulable)中,提供一种精确度评估方法。