The use and analysis of massive data are challenging due to the high storage and computational cost. Subsampling algorithms are popular to downsize the data volume and reduce the computational burden. Existing subsampling approaches focus on data with numerical covariates. Although big data with categorical covariates are frequently encountered in many disciplines, the subsampling plan has not been well established. In this paper, we propose a balanced subsampling approach for reducing data with categorical covariates. The selected subsample achieves a combinatorial balance among values of covariates and therefore enjoys three desired merits. First, a balanced subsample is nonsingular and thus allows the estimation of all parameters in ANOVA regression. Second, it provides the optimal parameter estimation in the sense of minimizing the generalized variance of the estimated parameters. Third, the model trained on a balanced subsample provides robust predictions in the sense of minimizing the worst-case prediction error. We demonstrate the usefulness of the balanced subsampling over existing data reduction methods in extensive simulation studies and a real-world application.
翻译:大量数据的使用和分析由于存储和计算成本高而具有挑战性。子抽样算法在缩小数据量和减少计算负担方面很受欢迎。现有的子抽样方法侧重于数字共变数的数据。虽然在许多学科中经常遇到绝对共变的大数据,但子抽样计划尚未完全确定。在本文件中,我们提出了一个平衡的子抽样方法,用绝对共变数来减少数据。选定的子抽样方法在共变数的数值之间实现了组合平衡,因此享有三种预期的优点。首先,平衡的子抽样是非单数的,因此可以估计ANOVA回归的所有参数。第二,它提供了最佳参数估计值,以尽量减少估计参数的普遍差异。第三,经过平衡的子抽样培训的模型提供了强有力的预测,以尽量减少最坏的预测错误。我们证明在广泛的模拟研究和现实世界应用中平衡的子抽样方法是有用的。