Nonparametric regression imputation is commonly used in missing data analysis. However, it suffers from the ``curse of dimension". The problem can be alleviated by the explosive sample size in the era of big data, while the large-scale data size presents some challenges on the storage of data and the calculation of estimators. These challenges make the classical nonparametric regression imputation methods no longer applicable. This motivates us to develop two distributed nonparametric regression imputation methods. One is based on kernel smoothing and the other on the sieve method. The kernel-based distributed imputation method has extremely low communication cost and the sieve-based distributed imputation method can accommodate more local machines. To illustrate the proposed imputation methods, response mean estimation is considered. Two distributed nonparametric regression imputation estimators are proposed for the response mean, which are proved to be asymptotically normal with asymptotic variances achieving the semiparametric efficiency bound. The proposed methods are evaluated through simulation studies and are illustrated by a real data analysis.
翻译:在缺失的数据分析中,通常使用非参数回归率估算法,但是,它受 " 维度 " 的影响。问题可以通过大数据时代的爆炸性样本规模来缓解,而大型数据规模对数据储存和估计器的计算提出了一些挑战。这些挑战使得传统的非参数回归率估算法不再适用。这促使我们开发两种分布式的非参数回归率估算法。一种基于内核平滑,另一种基于筛选法。内核分布式估算法的通信成本极低,而基于筛选法的分布式估算法可以容纳更多的本地机器。为了说明拟议的估算法,考虑了反应平均值。两种分布式的非参数回归率估算法是针对反应平均值提出的,事实证明,这些数值与达到半参数效率约束的微量差异一样正常。建议的方法通过模拟研究加以评估,并通过真实的数据分析加以说明。