采样与删除法——一种适合有限计算资源下大数据分析的实用解决方案 (Subsampling and Jackknifing: A Practically Convenient Solution for Large Data Analysis with Limited Computational Resources)

Modern statistical analysis often encounters datasets with large sizes. For these datasets, conventional estimation methods can hardly be used immediately because practitioners often suffer from limited computational resources. In most cases, they do not have powerful computational resources (e.g., Hadoop or Spark). How to practically analyze large datasets with limited computational resources then becomes a problem of great importance. To solve this problem, we propose here a novel subsampling-based method with jackknifing. The key idea is to treat the whole sample data as if they were the population. Then, multiple subsamples with greatly reduced sizes are obtained by the method of simple random sampling with replacement. It is remarkable that we do not recommend sampling methods without replacement because this would incur a significant cost for data processing on the hard drive. Such cost does not exist if the data are processed in memory. Because subsampled data have relatively small sizes, they can be comfortably read into computer memory as a whole and then processed easily. Based on subsampled datasets, jackknife-debiased estimators can be obtained for the target parameter. The resulting estimators are statistically consistent, with an extremely small bias. Finally, the jackknife-debiased estimators from different subsamples are averaged together to form the final estimator. We theoretically show that the final estimator is consistent and asymptotically normal. Its asymptotic statistical efficiency can be as good as that of the whole sample estimator under very mild conditions. The proposed method is simple enough to be easily implemented on most practical computer systems and thus should have very wide applicability.

翻译：当今统计分析时常遇到大型数据集的情况。在这些数据集情况下，传统的估计方法往往无法直接使用，因为从业者往往面临有限的计算资源。在大部分的情况下，他们没有强大的计算资源（例如Hadoop或Spark）。然后，如何在有限的计算资源下分析大型数据集将成为一个非常重要的问题。为了解决这个问题，我们在此提出了一种新的基于抽样的方法，并进行了剖析。其中关键的思想是将完整的样本数据视为群体数据。然后，通过带有替换的随机抽样方法来获取多个大大减少的子样本。值得注意的是，我们不推荐不带有替换的抽样方法，因为这会产生查询原始数据硬盘存储大量数据所需的显着成本。如果数据在内存中处理，则无此成本。由于子采样数据具有相对较小的大小，因此可以轻松地完整地读入计算机内存中，然后轻松处理。基于子采样数据集，可以获得目标参数的jackknife-debiased估计器。得出的估计器在统计学意义下是一致的，并且偏差非常小。最后，将不同子样本的套袋偏差估计器求平均值以形成最终估计器。我们在理论上表明，最终估计器一致且渐进正常。在非常温和的情况下，它的渐进统计效率可以与整个样本估计器一样好。所提出的方法足够简单，可轻松在大多数实用计算机系统上实现，因此适用性非常广泛。