Generalization is an important attribute of machine learning models, particularly for those that are to be deployed in a medical context, where unreliable predictions can have real world consequences. While the failure of models to generalize across datasets is typically attributed to a mismatch in the data distributions, performance gaps are often a consequence of biases in the 'ground-truth' label annotations. This is particularly important in the context of medical image segmentation of pathological structures (e.g. lesions), where the annotation process is much more subjective, and affected by a number underlying factors, including the annotation protocol, rater education/experience, and clinical aims, among others. In this paper, we show that modeling annotation biases, rather than ignoring them, poses a promising way of accounting for differences in annotation style across datasets. To this end, we propose a generalized conditioning framework to (1) learn and account for different annotation styles across multiple datasets using a single model, (2) identify similar annotation styles across different datasets in order to permit their effective aggregation, and (3) fine-tune a fully trained model to a new annotation style with just a few samples. Next, we present an image-conditioning approach to model annotation styles that correlate with specific image features, potentially enabling detection biases to be more easily identified.
翻译:通用化是机器学习模型的一个重要属性,特别是对于在医学环境中部署的机器学习模型而言,不可靠的预测可能产生真实的世界后果。虽然模型不普及跨数据集的失败通常归因于数据分布不匹配,但性能差距往往是“地面真相”标签注释中偏差的结果。这在病理结构(如损伤)医学图像分解方面特别重要,因为注解过程比较主观,并且受到许多基本因素,包括注解协议、评级教育/经验以及临床目标等的简单影响。在本文中,我们表明,建模注偏见而不是忽略它们,是计算各数据集注解风格差异的一个很有希望的方法。为此,我们提出了一个通用的调节框架:(1) 使用单一模型来学习和核算多个数据集的不同注解风格,(2) 确定不同数据集的类似注风格,以便有效汇总,(3) 微调、不偏重、不重现的直率方法,以新的风格进行我们所培训的模型,从而形成一个潜在的格式。