Real-world Electronic Health Records (EHRs) are often plagued by a high rate of missing data. In our EHRs, for example, the missing rates can be as high as 90% for some features, with an average missing rate of around 70% across all features. We propose a Time-Aware Dual-Cross-Visit missing value imputation method, named TA-DualCV, which spontaneously leverages multivariate dependencies across features and longitudinal dependencies both within- and cross-visit to maximize the information extracted from limited observable records in EHRs. Specifically, TA-DualCV captures the latent structure of missing patterns across measurements of different features and it also considers the time continuity and capture the latent temporal missing patterns based on both time-steps and irregular time-intervals. TA-DualCV is evaluated using three large real-world EHRs on two types of tasks: an unsupervised imputation task by varying mask rates up to 90% and a supervised 24-hour early prediction of septic shock using Long Short-Term Memory (LSTM). Our results show that TA-DualCV performs significantly better than all of the existing state-of-the-art imputation baselines, such as DETROIT and TAME, on both types of tasks.
翻译:现实世界电子健康记录(EHRs)常常被大量缺失数据困扰。例如,在我们的电子健康记录(EHRs)中,某些特征的缺失率可能高达90%,所有特征的平均缺失率约为70%左右。我们提议了一个名为TA-DualCV的“时间-软件双曲线-天视缺失价值估算法 ”, 该方法自发地利用各种特征之间的多重依赖性和纵向依赖性,以最大限度地利用从EHRs有限的可观测记录中提取的信息。具体地说,TA-DualCV 捕捉到不同特征测量中缺失模式的潜在结构,平均缺失率约为70%。我们还根据时间跨度和不规则的时间跨度来考虑时间-时间-天际计算方法。 TA-DalCV 正在用三种大型真实世界电子HR(EHRs)来评估两种任务:一种是未经监督的估算的估算任务,其遮盖率高达90%,而一种是监督的24小时的化预测,利用长期短期记忆(LSTM)对化电磁测测测,其所有类型都显示TA-AT-AT-D-D-D-D-D-D-D-D-D-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-S-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-T-