Clinical machine learning models experience significantly degraded performance in datasets not seen during training, e.g., new hospitals or populations. Recent developments in domain generalization offer a promising solution to this problem by creating models that learn invariances across environments. In this work, we benchmark the performance of eight domain generalization methods on multi-site clinical time series and medical imaging data. We introduce a framework to induce synthetic but realistic domain shifts and sampling bias to stress-test these methods over existing non-healthcare benchmarks. We find that current domain generalization methods do not consistently achieve significant gains in out-of-distribution performance over empirical risk minimization on real-world medical imaging data, in line with prior work on general imaging datasets. However, a subset of realistic induced-shift scenarios in clinical time series data do exhibit limited performance gains. We characterize these scenarios in detail, and recommend best practices for domain generalization in the clinical setting.
翻译:临床机床学习模型在培训期间未见的数据集(例如新医院或人口)的性能显著下降; 领域一般化的最近发展为解决这一问题提供了很有希望的解决办法,通过创建能够学习各种环境差异的模型; 在这项工作中,我们将8个领域一般化方法的性能以多现场临床时间序列和医疗成像数据为基准; 我们引入一个框架,促使合成但现实的域变换和抽样偏差,以对照现有的非保健基准对这些方法进行压力测试; 我们发现,当前领域一般化方法在实际世界医学成像数据中将经验风险降到最低方面,并没有一贯取得重大成果; 但是,临床时间序列数据中一套现实的诱导变假设的性效果有限; 我们详细描述这些情景,并建议临床环境域化的最佳做法。