As machine learning (ML) models gain traction in clinical applications, understanding the impact of clinician and societal biases on ML models is increasingly important. While biases can arise in the labels used for model training, the many sources from which these biases arise are not yet well-studied. In this paper, we highlight disparate censorship (i.e., differences in testing rates across patient groups) as a source of label bias that clinical ML models may amplify, potentially causing harm. Many patient risk-stratification models are trained using the results of clinician-ordered diagnostic and laboratory tests of labels. Patients without test results are often assigned a negative label, which assumes that untested patients do not experience the outcome. Since orders are affected by clinical and resource considerations, testing may not be uniform in patient populations, giving rise to disparate censorship. Disparate censorship in patients of equivalent risk leads to undertesting in certain groups, and in turn, more biased labels for such groups. Using such biased labels in standard ML pipelines could contribute to gaps in model performance across patient groups. Here, we theoretically and empirically characterize conditions in which disparate censorship or undertesting affect model performance across subgroups. Our findings call attention to disparate censorship as a source of label bias in clinical ML models.
翻译:由于机器学习(ML)模式在临床应用中获得牵引力,了解临床和社会偏见对ML模式的影响越来越重要。虽然在模型培训使用的标签上可能会出现偏见,但这种偏见产生的原因还没有得到很好地研究。本文强调不同的审查(即病人群体之间测试率的差异)是临床ML模式可能扩大并可能造成伤害的标签偏见的来源。许多病人风险批准模式使用临床诊断和实验室对标签进行测试的结果进行培训。没有检测结果的病人往往被给一个负面标签,这种标签假定未经测试的病人不会经历结果。由于订单受到临床和资源考虑的影响,测试在病人群体中可能不统一,导致不同的审查。对同等风险的病人进行不同的审查导致某些群体测试不足,反过来,对这些群体的标签更带有偏见。使用标准的ML管道中的这种偏差标签可能会造成病人群体在模型上的表现差距。在这里,我们从理论上和实验性地描述在不相同的临床审查中引起人们注意的条件,在临床审查中要求不同的研究分组。