Medical tasks are prone to inter-rater variability due to multiple factors such as image quality, professional experience and training, or guideline clarity. Training deep learning networks with annotations from multiple raters is a common practice that mitigates the model's bias towards a single expert. Reliable models generating calibrated outputs and reflecting the inter-rater disagreement are key to the integration of artificial intelligence in clinical practice. Various methods exist to take into account different expert labels. We focus on comparing three label fusion methods: STAPLE, average of the rater's segmentation, and random sampling of each rater's segmentation during training. Each label fusion method is studied using both the conventional training framework and the recently published SoftSeg framework that limits information loss by treating the segmentation task as a regression. Our results, across 10 data splittings on two public datasets, indicate that SoftSeg models, regardless of the ground truth fusion method, had better calibration and preservation of the inter-rater rater variability compared with their conventional counterparts without impacting the segmentation performance. Conventional models, i.e., trained with a Dice loss, with binary inputs, and sigmoid/softmax final activate, were overconfident and underestimated the uncertainty associated with inter-rater variability. Conversely, fusing labels by averaging with the SoftSeg framework led to underconfident outputs and overestimation of the rater disagreement. In terms of segmentation performance, the best label fusion method was different for the two datasets studied, indicating this parameter might be task-dependent. However, SoftSeg had segmentation performance systematically superior or equal to the conventionally trained models and had the best calibration and preservation of the inter-rater variability.
翻译:由于图像质量、专业经验、培训或准则清晰度等多种因素,医学任务容易发生跨行业变化。培训深层次学习网络,加上多发率员的说明,是减少模型偏向单一专家的一种常见做法。可靠的模型产生校准产出并反映跨行业的分歧,是将人工智能纳入临床实践的关键。有各种方法可以考虑不同的专家标签。我们侧重于比较三个标签混凝法:STAPLE, 标标的分解平均值, 以及每个调价师在培训期间的分解随机抽样。每种标签混合法都是使用常规培训框架和最近公布的软SoftSeg框架来研究,通过将分解任务视为回归来限制信息损失。我们的结果,在两个公共数据集上对10个数据进行分解,表明SoftSeget模型,不管地面真相混合方法如何,都比传统的对测价框架的校准和保存得更好,而不会影响分解性业绩。常规模型,对调值的调校正调值术语和正变差值框架进行最终的测试,Sefrical-deal Studal Stal-deal disal dislation Sildal 和Sildal dislate dislate disal dedududududu 和Sild disdudududududududeal disldal disal disldal or the ordaldaldaldald ordaldaldaldaldaldaldal ord ord ord ord disaldald ord ordaldaldald disaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldald lad lautd lautdaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldaldald 和制制制制是两种是两种