Massive data analysis becomes increasingly prevalent, subsampling methods like BLB (Bag of Little Bootstraps) serves as powerful tools for assessing the quality of estimators for massive data. However, the performance of the subsampling methods are highly influenced by the selection of tuning parameters ( e.g., the subset size, number of resamples per subset ). In this article we develop a hyperparameter selection methodology, which can be used to select tuning parameters for subsampling methods. Specifically, by a careful theoretical analysis, we find an analytically simple and elegant relationship between the asymptotic efficiency of various subsampling estimators and their hyperparameters. This leads to an optimal choice of the hyperparameters. More specifically, for an arbitrarily specified hyperparameter set, we can improve it to be a new set of hyperparameters with no extra CPU time cost, but the resulting estimator's statistical efficiency can be much improved. Both simulation studies and real data analysis demonstrate the superior advantage of our method.
翻译:大规模数据分析日益普遍,诸如BLB(小诱杀装置的Bag)等次抽样方法成为评估大规模数据测算器质量的有力工具,然而,亚抽样方法的性能受到调试参数选择的极大影响(例如子尺寸、每子次样本的数量)。在本篇文章中,我们开发了一种超参数选择方法,可用于为次抽样方法选择调试参数。具体地说,通过仔细的理论分析,我们发现各种次抽样测算器及其超光谱仪的无症状效率之间存在简单和优雅的分析性关系。这导致对超参数作出最佳选择。更具体地说,对于任意指定的超参数集,我们可以将其改进为一套新的超参数,而没有额外的CPU时间成本,但由此产生的估测器的统计效率可以大大改进。两个模拟研究和实际数据分析都显示了我们方法的优势。