Subsampling or subdata selection is a useful approach in large-scale statistical learning. Most existing studies focus on model-based subsampling methods which significantly depend on the model assumption. In this paper, we consider the model-free subsampling strategy for generating subdata from the original full data. In order to measure the goodness of representation of a subdata with respect to the original data, we propose a criterion, generalized empirical F-discrepancy (GEFD), and study its theoretical properties in connection with the classical generalized L2-discrepancy in the theory of uniform designs. These properties allow us to develop a kind of low-GEFD data-driven subsampling method based on the existing uniform designs. By simulation examples and a real case study, we show that the proposed subsampling method is superior to the random sampling method. Moreover, our method keeps robust under diverse model specifications while other popular subsampling methods are under-performing. In practice, such a model-free property is more appealing than the model-based subsampling methods, where the latter may have poor performance when the model is misspecified, as demonstrated in our simulation studies.
翻译:在大规模统计学习中,抽样或子数据选择是一种有用的方法。大多数现有研究侧重于基于模型的子抽样方法,这些方法在很大程度上取决于模型假设。在本文中,我们考虑了从原始完整数据生成子数据的无模型的子抽样战略。为了衡量子数据在原始数据中的表述是否良好,我们提出了一个标准,即普遍经验性F-差异(GEFD),并研究其理论特性,与统一设计理论中传统的通用L2-差异性(L2-差异性)相关的理论特性。这些特性使我们能够根据现有统一设计开发一种低全球环球开发数据驱动子抽样方法。我们通过模拟实例和真正的案例研究,表明拟议的子抽样方法优于随机抽样方法。此外,我们的方法在不同的模型规格下保持稳健,而其他流行的子抽样方法则表现不佳。在实践中,这种无模型的属性比基于模型的子抽样方法更具吸引力,因为模型在模拟研究中显示的模型错误定型时,后者的性能可能较差。