Grading precancerous lesions on whole slide images is a challenging task: the continuous space of morphological phenotypes makes clear-cut decisions between different grades often difficult, leading to low inter- and intra-rater agreements. More and more Artificial Intelligence (AI) algorithms are developed to help pathologists perform and standardize their diagnosis. However, those models can render their prediction without consideration of the ambiguity of the classes and can fail without notice which prevent their wider acceptance in a clinical context. In this paper, we propose a new score to measure the confidence of AI models in grading tasks. Our confidence score is specifically adapted to ordinal output variables, is versatile and does not require extra training or additional inferences nor particular architecture changes. Comparison to other popular techniques such as Monte Carlo Dropout and deep ensembles shows that our method provides state-of-the art results, while being simpler, more versatile and less computationally intensive. The score is also easily interpretable and consistent with real life hesitations of pathologists. We show that the score is capable of accurately identifying mispredicted slides and that accuracy for high confidence decisions is significantly higher than for low-confidence decisions (gap in AUC of 17.1% on the test set). We believe that the proposed confidence score could be leveraged by pathologists directly in their workflow and assist them on difficult tasks such as grading precancerous lesions.
翻译:整个幻灯片图像的细微损伤是项艰巨的任务:形态型型的连续空间使不同职等之间往往很难做出明确的决定,导致低纬度和内部协议。开发了越来越多的人工智能算法,以帮助病理学家进行诊断并使其标准化。然而,这些模型可以不考虑班级的模糊性而作出预测,而且可以不经事先通知而使其无法在临床环境中被广泛接受。在本文件中,我们提出一个新的分数,以衡量AI模型对定级任务的信心。我们的信任分数具体适应了或定值变量,具有多功能性,不需要额外的培训或额外的推断,也不需要特定的架构变化。与其他流行技术(如蒙特卡洛辍学和深层集合)的比较表明,我们的方法提供了最新艺术结果,而没有考虑到班级的模糊性、多功能性和较少的计算强度。得分数也容易解释,而且与病理学家的真实生活中犹豫不决相一致。我们表明,我们的分数能够准确地识别错误的幻灯片,而且不需要额外的或额外的推断或特定的结构变化。 与其他流行技术比较,例如蒙特卡洛的辍学决定,我们可以直接地认为,我们在17年的排名中选择。</s>