Medical data sets are usually corrupted by noise and missing data. These missing patterns are commonly assumed to be completely random, but in medical scenarios, the reality is that these patterns occur in bursts due to sensors that are off for some time or data collected in a misaligned uneven fashion, among other causes. This paper proposes to model medical data records with heterogeneous data types and bursty missing data using sequential variational autoencoders (VAEs). In particular, we propose a new methodology, the Shi-VAE, which extends the capabilities of VAEs to sequential streams of data with missing observations. We compare our model against state-of-the-art solutions in an intensive care unit database (ICU) and a dataset of passive human monitoring. Furthermore, we find that standard error metrics such as RMSE are not conclusive enough to assess temporal models and include in our analysis the cross-correlation between the ground truth and the imputed signal. We show that Shi-VAE achieves the best performance in terms of using both metrics, with lower computational complexity than the GP-VAE model, which is the state-of-the-art method for medical records.
翻译:医疗数据集通常被噪音和缺失的数据所腐蚀。这些缺失的模式通常被假定是完全随机的,但在医疗假设中,现实是,这些模式是连续发生的,原因是传感器关闭一段时间或以不协调的不均衡方式收集的数据,等等。本文件提议使用相继变异自动对数仪(VAEs)来模拟医疗数据记录,并使用各种数据类型和爆发性缺失的数据。特别是,我们提议了一种新的方法,即Shi-VAE,它将VAE的能力扩大到有缺失观测的相继数据流。我们比较了在密集护理单位数据库中最先进的解决方案模型和被动人类监测数据集。此外,我们发现标准误差指标,如RMSE,不足以评估时间模型,并将地面真相与受污染信号之间的交叉关系纳入我们的分析中。我们显示,Shi-VAE在使用两种指标方面都取得了最佳的性能,其计算复杂性低于GP-VAE模型,而GVAE模型是用于医疗记录的最先进方法。