Pooling publicly-available MRI data from multiple sites allows to assemble extensive groups of subjects, increase statistical power, and promote data reuse with machine learning techniques. The harmonization of multicenter data is necessary to reduce the confounding effect associated with non-biological sources of variability in the data. However, when applied to the entire dataset before machine learning, the harmonization leads to data leakage, because information outside the training set may affect model building, and potentially falsely overestimate performance. We propose a 1) measurement of the efficacy of data harmonization; 2) harmonizer transformer, i.e., an implementation of the ComBat harmonization allowing its encapsulation among the preprocessing steps of a machine learning pipeline, avoiding data leakage. We tested these tools using brain T1-weighted MRI data from 1740 healthy subjects acquired at 36 sites. After harmonization, the site effect was removed or reduced, and we measured the data leakage effect in predicting individual age from MRI data, highlighting that introducing the harmonizer transformer into a machine learning pipeline allows for avoiding data leakage.
翻译:从多个站点收集公开获得的磁共振数据,可以汇集广泛的主题群,增加统计力量,促进利用机器学习技术重新利用数据。为了减少数据变异的非生物来源引起的混乱效应,必须统一多中心数据。然而,在对机器学习前对整个数据集应用时,协调统一会导致数据泄漏,因为培训组之外的信息可能影响模型建设,而且可能错误地高估性能。我们建议1)衡量数据统一的效率;2)协调器变压器,即实施ComBat协调器,使其在机器学习管道的预处理步骤中封装,避免数据泄漏。我们利用在36个地点获得的1740个健康科目的脑T1加权MRI数据测试了这些工具。在统一后,现场效应被消除或减少,我们测量了从磁共振成数据中预测个体年龄的数据泄漏效应,强调将协调器变压器引入机器学习管道可以避免数据泄漏。