Generalization is an important attribute of machine learning models, particularly for those that are to be deployed in a medical context, where unreliable predictions can have real world consequences. While the failure of models to generalize across datasets is typically attributed to a mismatch in the data distributions, performance gaps are often a consequence of biases in the ``ground-truth" label annotations. This is particularly important in the context of medical image segmentation of pathological structures (e.g. lesions), where the annotation process is much more subjective, and affected by a number underlying factors, including the annotation protocol, rater education/experience, and clinical aims, among others. In this paper, we show that modeling annotation biases, rather than ignoring them, poses a promising way of accounting for differences in annotation style across datasets. To this end, we propose a generalized conditioning framework to (1) learn and account for different annotation styles across multiple datasets using a single model, (2) identify similar annotation styles across different datasets in order to permit their effective aggregation, and (3) fine-tune a fully trained model to a new annotation style with just a few samples. Next, we present an image-conditioning approach to model annotation styles that correlate with specific image features, potentially enabling detection biases to be more easily identified.
翻译:通用化是机器学习模型的一个重要属性,特别是对于在医学环境中部署的机器学习模型而言,不可靠的预测可能产生真实的世界后果。虽然模型未能普及跨数据集,通常归因于数据分布不匹配,但性能差距往往是“地面真相”标签注释中偏差的结果。这在病理结构(例如,损伤)医学图像分解方面特别重要,在这种结构中,批注过程更加主观,并受到许多基本因素的影响,包括说明协议、评级教育/经验以及临床目标等。在本文中,我们表明,模拟批注偏差而不是忽略它们,是计算跨数据集批注风格差异的有希望的方法。为此,我们建议一个通用的调节框架:(1) 使用单一模型学习和核算多个数据集的不同批注风格,(2) 确定不同数据集的类似批注风格,以便有效汇总,(3) 微调-经验教育/经验以及临床目标等。在本文中,我们表明,建模偏向的偏向,一个完全经过培训的模型,具有潜在的新风格,我们有一定的排序。