Heterogeneity in medical data, e.g., from data collected at different sites and with different protocols in a clinical study, is a fundamental hurdle for accurate prediction using machine learning models, as such models often fail to generalize well. This paper leverages a recently proposed normalizing-flow-based method to perform counterfactual inference upon a structural causal model (SCM), in order to achieve harmonization of such data. A causal model is used to model observed effects (brain magnetic resonance imaging data) that result from known confounders (site, gender and age) and exogenous noise variables. Our formulation exploits the bijection induced by flow for the purpose of harmonization. We infer the posterior of exogenous variables, intervene on observations, and draw samples from the resultant SCM to obtain counterfactuals. This approach is evaluated extensively on multiple, large, real-world medical datasets and displayed better cross-domain generalization compared to state-of-the-art algorithms. Further experiments that evaluate the quality of confounder-independent data generated by our model using regression and classification tasks are provided.
翻译:医疗数据(例如,不同地点收集的数据和临床研究中不同协议的数据)的异质性,是利用机器学习模型进行准确预测的根本障碍,因为这类模型往往不能很好地概括。本文利用最近提出的一种基于正常流的方法,对结构性因果模型(SCM)进行反事实推论,以便统一这些数据。一个因果模型用于模拟已知混凝土(地点、性别和年龄)和外源噪音变量产生的观测效应(脑磁共振成像数据)和外源噪音变量。我们的配方为协调目的利用流动产生的双射线。我们推断外源变量的外源外源外源外源变量的外源,对观测进行干预,并从结果的SCM中抽取样本,以获取相反事实。这一方法在多个大型实际世界医学数据集上进行了广泛评价,并展示了与最新算法相比更好的交叉概括。还进行了进一步实验,以评价我们模型利用回归和分类任务生成的相依赖的数据的质量。