In health-pollution cohort studies, accurate predictions of pollutant concentrations at new locations are needed, since the locations of fixed monitoring sites and study participants are often spatially misaligned. For multi-pollution data, principal component analysis (PCA) is often incorporated to obtain low-rank (LR) structure of the data prior to spatial prediction. Recently developed predictive PCA modifies the traditional algorithm to improve the overall predictive performance by leveraging both LR and spatial structures within the data. However, predictive PCA requires complete data or an initial imputation step. Nonparametric imputation techniques without accounting for spatial information may distort the underlying structure of the data, and thus further reduce the predictive performance. We propose a convex optimization problem inspired by the LR matrix completion framework and develop a proximal algorithm to solve it. Missing data are imputed and handled concurrently within the algorithm, which eliminates the necessity of a separate imputation step. We show that our algorithm has low computational burden and leads to reliable predictive performance as the severity of missing data increases.
翻译:在健康污染组别研究中,需要准确预测新地点的污染物浓度,因为固定监测地点和研究参与者的地点往往在空间上不对称。对于多污染数据,主要组成部分分析(PCA)通常在空间预测之前就被纳入,以获得低水平的数据结构。最近开发的预测五氯苯甲醚对传统算法进行了修改,通过利用数据中的远距离和空间结构来提高总体预测性能。然而,预测五氯苯甲醚需要完整的数据或初步估算步骤。不考虑空间信息的非参数估算法可能会扭曲数据的基本结构,从而进一步降低预测性能。我们提出受LR矩阵完成框架启发的松动优化问题,并开发一种准算法来解决这一问题。缺失的数据是估算和在算法中同时处理的,这就消除了单独估算性步骤的必要性。我们表明,我们的算法的计算负担较低,随着缺失数据的严重程度增加,导致可靠的预测性能。