An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical tasks. This growing area of research has exposed the limitation of accessibility of EHR datasets for all, as well as the reproducibility of different modeling frameworks. One reason for these limitations is the lack of standardized pre-processing pipelines. MIMIC is a freely available EHR dataset in a raw format used in numerous studies. The absence of standardized pre-processing steps serves as a significant barrier to the wider adoption of the dataset. It also leads to different cohorts being used in downstream tasks, limiting the ability to compare the results among similar studies. Contrasting studies also use various distinct performance metrics, which can greatly reduce the ability to compare model results. In this work, we provide an end-to-end fully customizable pipeline to extract, clean, and pre-process data; and to predict and evaluate the fourth version of the MIMIC dataset (MIMIC-IV) for ICU and non-ICU-related clinical time-series prediction tasks. The tool is publicly available at https://github.com/healthylaife/MIMIC-IV-Data-Pipeline.
翻译:越来越多的研究致力于将机器学习方法应用于电子健康记录(EHR)数据,用于各种临床任务;这一日益扩大的研究领域暴露了对所有人获得EHR数据集的限制,以及不同模型框架的可复制性;这些限制的一个原因是缺乏标准化的处理前管道;MIMIMIC是一个以许多研究所使用的原始格式免费提供的EHR数据集;缺乏标准化的处理前步骤是更广泛地采用数据集的一大障碍;还导致在下游任务中使用不同的组群,从而限制了对类似研究的结果进行比较的能力;对比研究还使用各种不同的性能衡量标准,这可以大大降低比较模型结果的能力;在这项工作中,我们提供了可完全定制的提取、清理和处理前数据的终端至终端管道;预测和评价用于ICU和与ICU无关的临床时序预测任务的MIMIC数据集第四版(MIMIC-IV);该工具在https://github.com/heylaifeMIMIMIMIMILA/MILA)上公开提供。