用于大规模数据的功能 L-最佳性子抽样 (Functional L-Optimality Subsampling for Massive Data)

Massive data bring the big challenges of memory and computation to researchers, which can be tackled to some extent by taking subsamples from the full data as a surrogate. For functional data, it is common to collect measurements intensively over their domains, which require more memory and computation time when the sample size is large. The situation would be much worse when the statistical inference is made through bootstrap samples. To the best of our knowledge, there is no work to study the subsampling for the functional linear regression or its generation systematically. In this article, based on the functional L-optimality criterion we propose an optimal subsampling method for the functional linear model. When the response is a discrete or categorical variable, we further extend this subsampling method to the functional generalized linear model. We establish the asymptotic properties of the resultant estimators by the subsampling methods. The finite sample performance of our proposed subsampling methods is investigated by extensive simulation studies. We also apply our proposed subsampling methods to analyze the global climate data and the kidney transplant data. The results from the analysis of these data show that the optimal subsampling methods motivated by the functional L-optimality criterion are much better than the uniform subsampling method and can well approximate the results based on full data.

翻译：大规模数据给研究人员带来了记忆和计算方面的巨大挑战,这在某种程度上可以通过从完整数据中取出子样本作为替代。对于功能性数据,通常的做法是在他们的域内收集大量测量,在样本大小大时需要更多的内存和计算时间。如果通过靴带样本进行统计推断,情况会更糟。根据我们的最佳知识,我们没有研究功能性线性回归或系统生成的子样本。在本篇文章中,我们根据功能性L-最佳标准,为功能性线性模型建议一种最佳的子抽样方法。当反应是离散或绝对变量时,我们通常会将这种子抽样方法扩大到功能性通用线性模型。我们用子取样方法确定结果的随机性特性,通过广泛的模拟研究来调查我们提议的子抽样方法的有限性能。我们还根据功能性L-优化标准为功能性线性线性模型提出一种最佳的子抽样方法。当反应是离散或绝对变量时,我们将进一步将这一次抽样方法推广到功能性通用线性线性模型。我们用子样本分析的结果比根据功能性标准得出的最优标准显示最优的子抽样方法,可以用来分析全球气候数据和肾移植数据。