Pre-training has shown success in different areas of machine learning, such as Computer Vision (CV), Natural Language Processing (NLP) and medical imaging. However, it has not been fully explored for clinical data analysis. Even though an immense amount of Electronic Health Record (EHR) data is recorded, data and labels can be scarce if the data is collected in small hospitals or deals with rare diseases. In such scenarios, pre-training on a larger set of EHR data could improve the model performance. In this paper, we apply unsupervised pre-training to heterogeneous, multi-modal EHR data for patient outcome prediction. To model this data, we leverage graph deep learning over population graphs. We first design a network architecture based on graph transformer designed to handle various input feature types occurring in EHR data, like continuous, discrete, and time-series features, allowing better multi-modal data fusion. Further, we design pre-training methods based on masked imputation to pre-train our network before fine-tuning on different end tasks. Pre-training is done in a fully unsupervised fashion, which lays the groundwork for pre-training on large public datasets with different tasks and similar modalities in the future. We test our method on two medical datasets of patient records, TADPOLE and MIMIC-III, including imaging and non-imaging features and different prediction tasks. We find that our proposed graph based pre-training method helps in modeling the data at a population level and further improves performance on the fine tuning tasks in terms of AUC on average by 4.15% for MIMIC and 7.64% for TADPOLE.
翻译:培训前在计算机视野(CV)、自然语言处理(NLP)和医疗成像等机器学习的不同领域表现出成功。然而,没有为临床数据分析充分探索它。尽管记录了大量电子健康记录(EHR)数据,但如果在小型医院收集数据或处理罕见疾病,数据和标签可能稀缺。在这样的情况下,关于更大一套EHR数据的培训前可以改进模型性能。在本文中,我们将未经监督的培训前数据应用于多种类型,用于病人结果预测的多式EHR数据。为了模拟这一数据,我们利用了在人口图表上进行深层次的学习。我们首先设计了一个基于图表变异的网络结构,用于处理电子健康记录中出现的各种输入特征,例如连续、离散和时间序列特性。在这样的假设中,我们设计了基于掩码模型的预培训方法来改进模型性能。在对不同的终端任务进行微调之前,我们设计了我们网络的预演方法。前培训是在完全不精确的状态下进行,在不精确的状态上进行,在不精确的状态上进行,在不精确的状态上进行,在不精确的状态上进行,在不精确的状态上进行,在不同的MA前数据分析前的数据分析中,在不同的数据分析中,在两种方法上,在不同的数据分析方法上,在不同的数据上,在不同的数据上,在不同的数据分析方法上,在不同的数据分析前,我们的数据上,我们的数据上,在不同的格式上,在不同的格式上,在不同的格式上,我们的数据分析前进行。