大规模医疗数据记录中多层次用于截肢的斯托克优化多级 (Multilevel Stochastic Optimization for Imputation in Massive Medical Data Records)

Exploration and analysis of massive datasets has recently generated increasing interest in the research and development communities. It has long been a recognized problem that many datasets contain significant levels of missing numerical data. We introduce a mathematically principled stochastic optimization imputation method based on the theory of Kriging. This is shown to be a powerful method for imputation. However, its computational effort and potential numerical instabilities produce costly and/or unreliable predictions, potentially limiting its use on large scale datasets. In this paper, we apply a recently developed multi-level stochastic optimization approach to the problem of imputation in massive medical records. The approach is based on computational applied mathematics techniques and is highly accurate. In particular, for the Best Linear Unbiased Predictor (BLUP) this multi-level formulation is exact, and is also significantly faster and more numerically stable. This permits practical application of Kriging methods to data imputation problems for massive datasets. We test this approach on data from the National Inpatient Sample (NIS) data records, Healthcare Cost and Utilization Project (HCUP), Agency for Healthcare Research and Quality. Numerical results show the multi-level method significantly outperforms current approaches and is numerically robust. In particular, it has superior accuracy as compared with methods recommended in the recent report from HCUP on the important problem of missing data, which could lead to sub-optimal and poorly based funding policy decisions. In comparative benchmark tests it is shown that the multilevel stochastic method is significantly superior to recommended methods in the report, including Predictive Mean Matching (PMM) and Predicted Posterior Distribution (PPD), with up to 75% reductions in error.

翻译：对大规模数据集的探索和分析最近引起了人们对研发界的兴趣。长期以来,一个公认的问题是,许多数据集包含大量缺失的数字数据。我们采用了基于克里金理论的数学原则性优化估算法。这被证明是一种强大的估算方法。然而,它的计算努力和潜在数字不稳定性产生了昂贵和(或)不可靠的预测,有可能限制其在大型数据集中的使用。在本文中,我们采用最近开发的多级随机优化方法来解决大量医疗记录中的估算问题。这种方法基于计算应用数学技术,并且非常准确。特别是,对最佳线性不偏差预测(BLUP)来说,这种方法是估算的有力方法。然而,它的计算努力和潜在数字不稳定性预测产生了昂贵的预测,因此,在大规模数据集的数据估算中,可以实际应用克里格方法来解决数据估算问题。我们用这个方法测试了国家住院抽样数据记录、保健成本和利用率项目(HCUP)的计算方法是以计算方法为基础的计算方法,在健康方面,高级数据分析结果中,包括当前数据分析方法的精确性分析方法,在数据分析中可以明显地显示,在数值分析方法中,在数值分析中,在数值分析中,数据分析方法中,在数值分析中可以显示特定的数值分析方法中,在数值分析方法中,在数据分析方法中可以显示特定的数值分析方法中,在数值分析方法中可以显示特定的数值分析方法中,在数据质量质量质量质量方面,在数据质量方面,在方法中可以明显质量上显示,在方法中显示,在方法中,在方法中可以明显质量中显示特定。