Deep Learning failure cases are abundant, particularly in the medical area. Recent studies in out-of-distribution generalization have advanced considerably on well-controlled synthetic datasets, but they do not represent medical imaging contexts. We propose a pipeline that relies on artifacts annotation to enable generalization evaluation and debiasing for the challenging skin lesion analysis context. First, we partition the data into levels of increasingly higher biased training and test sets for better generalization assessment. Then, we create environments based on skin lesion artifacts to enable domain generalization methods. Finally, after robust training, we perform a test-time debiasing procedure, reducing spurious features in inference images. Our experiments show our pipeline improves performance metrics in biased cases, and avoids artifacts when using explanation methods. Still, when evaluating such models in out-of-distribution data, they did not prefer clinically-meaningful features. Instead, performance only improved in test sets that present similar artifacts from training, suggesting models learned to ignore the known set of artifacts. Our results raise a concern that debiasing models towards a single aspect may not be enough for fair skin lesion analysis.
翻译:深层学习失败案例很多,特别是在医疗领域。最近对分布不当的合成数据集的研究进展很大,但并不代表医学成像背景。我们建议建立一个依靠人工制品批注的管道,以便能够对具有挑战性的皮肤损伤分析进行概括评估和贬低。首先,我们将数据分成越来越高的偏差培训和测试组,以便进行更精确的概括评估。然后,我们根据皮肤腐蚀性文物创造环境,以便采用区域通用方法。最后,经过强健的培训,我们开展了测试-时间偏移程序,减少了推断图像中的虚假特征。我们的实验显示,我们的管道改进了偏向性案例的性能指标,避免了使用解释方法的文物。不过,在对这种模型进行分布外的数据评价时,他们并不偏爱临床上有意义的特征。相反,我们只是改进了测试组,从培训中得出类似的文物,建议模型学会忽略已知的人工制品组。我们的实验结果使人们担心,向单一方面的偏向偏向性模型可能不足以进行公平的皮肤损害分析。