Recent work has shown that fine-tuning large networks is surprisingly sensitive to changes in random seed(s). We explore the implications of this phenomenon for model fairness across demographic groups in clinical prediction tasks over electronic health records (EHR) in MIMIC-III -- the standard dataset in clinical NLP research. Apparent subgroup performance varies substantially for seeds that yield similar overall performance, although there is no evidence of a trade-off between overall and subgroup performance. However, we also find that the small sample sizes inherent to looking at intersections of minority groups and somewhat rare conditions limit our ability to accurately estimate disparities. Further, we find that jointly optimizing for high overall performance and low disparities does not yield statistically significant improvements. Our results suggest that fairness work using MIMIC-III should carefully account for variations in apparent differences that may arise from stochasticity and small sample sizes.
翻译:最近的工作表明,微调大型网络对随机种子的变化具有惊人的敏感度。我们探讨了这种现象对临床预测任务中各人口群体在临床预测任务中相对于MIMIC-III(临床NLP研究的标准数据集)电子健康记录(EHR)的模型公平性的影响,MIMICMI-III(临床NLP研究的标准数据集)。对于产生类似总体性能的种子来说,明显分组性能差异很大,尽管没有证据表明总体性能与分组性能之间存在平衡。然而,我们也发现,研究少数群体交叉点所固有的少量抽样规模限制了我们准确估计差异的能力。此外,我们发现,联合优化总体性能和低差异不会在统计上产生显著的改善。我们的结果表明,使用MIMIC-III的公平性工作应仔细考虑因随机性和小样本规模而可能产生的明显差异。