Subsampling techniques can reduce the computational costs of processing big data. Practical subsampling plans typically involve initial uniform sampling and refined sampling. With a subsample, big data inferences are generally built on the inverse probability weighting (IPW), which becomes unstable when the probability weights are close to zero and cannot incorporate auxiliary information. First, we consider capture-recapture sampling, which combines an initial uniform sampling with a second Poisson sampling. Under this sampling plan, we propose an empirical likelihood weighting (ELW) estimation approach to an M-estimation parameter. Second, based on the ELW method, we construct a nearly optimal capture-recapture sampling plan that balances estimation efficiency and computational costs. Third, we derive methods for determining the smallest sample sizes with which the proposed sampling-and-estimation method produces estimators of guaranteed precision. Our ELW method overcomes the instability of IPW by circumventing the use of inverse probabilities, and utilizes auxiliary information including the size and certain sample moments of big data. We show that the proposed ELW method produces more efficient estimators than IPW, leading to more efficient optimal sampling plans and more economical sample sizes for a prespecified estimation precision. These advantages are confirmed through simulation studies and real data analyses.
翻译:小规模取样技术可以降低处理海量数据的计算成本。实用的小规模取样计划通常涉及初步统一取样和精细取样。在子抽样中,大数据推断通常建立在反概率加权(IPW)上,当概率加权接近零,不能包含辅助信息时,这种概率加权就会变得不稳定。首先,我们考虑捕获-捕获抽样,将初步统一取样与第二个Poisson取样结合起来。在这个取样计划下,我们提议对M估计参数采用经验性概率加权法(ELW)估算。第二,根据ELW方法,我们建立了一个几乎是最佳的捕获-捕获抽样取样计划,以平衡估计效率和计算成本。第三,我们为确定最小样本大小的方法制定方法,以得出保证精确度的估测。我们的ELW方法通过绕过使用错误概率,克服了IPW的不稳定性,并利用辅助信息,包括大数据的规模和某些取样时间。我们表明,拟议的ELW取样方法通过实际高效的精确度分析,比实际的精确度分析更精确性,我们表明,这些最精确的样本方法通过比实际的精确度分析方法,比实际的精确性更精确的样品分析更能更精确。