An increasing amount of research is being devoted to applying machine learning methods to electronic health record (EHR) data for various clinical tasks. This growing area of research has exposed the limitation of accessibility of EHR datasets for all, as well as the reproducibility of different modeling frameworks. One reason for these limitations is the lack of standardized pre-processing pipelines. MIMIC is a freely available EHR dataset in a raw format that has been used in numerous studies. The absence of standardized pre-processing steps serves as a major barrier to the wider adoption of the dataset. It also leads to different cohorts being used in downstream tasks, limiting the ability to compare the results among similar studies. Contrasting studies also use various distinct performance metrics, which can greatly reduce the ability to compare model results. In this work, we provide an end-to-end fully customizable pipeline to extract, clean, and pre-process data; and to predict and evaluate the fourth version of the MIMIC dataset (MIMIC-IV) for ICU and non-ICU-related clinical time-series prediction tasks.
翻译:越来越多的研究致力于将机器学习方法应用于电子健康记录(EHR)数据,用于各种临床任务;这一日益扩大的研究领域暴露了对所有人获得EHR数据集的限制,以及不同模型框架的可复制性;这些限制的一个原因是缺乏标准化的处理前管道;MIMIMIC是一个以原始格式免费提供的EHR数据集,许多研究都使用这种原始格式;没有标准化的处理前步骤是更广泛地采用数据集的主要障碍;还导致下游任务中使用不同的组群,从而限制了对类似研究进行比较的能力;对比研究还使用各种不同的性能衡量标准,这可以大大降低比较模型结果的能力;在这项工作中,我们提供一个完全可定制的终端至终端管道,以提取、清洁和处理前数据;预测和评价用于ICU和非ICU临床时间序列的MIMIC数据集第四版(MIMIM-IV),用于综合指数和非ICU临床时间序列的预测任务。