Functional data analysis (FDA) methods have computational and theoretical appeals for some high dimensional data, but lack the scalability to modern large sample datasets. To tackle the challenge, we develop randomized algorithms for two important FDA methods: functional principal component analysis (FPCA) and functional linear regression (FLR) with scalar response. The two methods are connected as they both rely on the accurate estimation of functional principal subspace. The proposed algorithms draw subsamples from the large dataset at hand and apply FPCA or FLR over the subsamples to reduce the computational cost. To effectively preserve subspace information in the subsamples, we propose a functional principal subspace sampling probability, which removes the eigenvalue scale effect inside the functional principal subspace and properly weights the residual. Based on the operator perturbation analysis, we show the proposed probability has precise control over the first order error of the subspace projection operator and can be interpreted as an importance sampling for functional subspace estimation. Moreover, concentration bounds for the proposed algorithms are established to reflect the low intrinsic dimension nature of functional data in an infinite dimensional space. The effectiveness of the proposed algorithms is demonstrated upon synthetic and real datasets.
翻译:功能数据分析(FDA)方法具有某些高维量数据的计算和理论吸引力,但缺乏对现代大样本数据集的可缩放性。为了应对这一挑战,我们为林业发展局的两种重要方法制定随机算法:功能主组成部分分析(FCCA)和功能线性回归(FLR),并配有斜度反应。这两种方法是相互联系的,因为它们都依赖对功能主次空间的准确估计。提议的算法从手头的大型数据集中提取子样本,对子样本应用FPCA或FLR,以减少计算成本。为了有效保存子样本中的子空间信息,我们提出了功能主次空间取样概率,这可以消除功能主次空间内的天平值尺度效应,并适当加权剩余部分。根据操作员的渗透分析,我们表明拟议的概率能够精确控制子空间预测操作员的第一个顺序错误,并可以解释为功能子空间估计的重要取样。此外,为了在子样本中有效保存子空间的计算,我们提议的算法的集中界限是用来反映功能性数据在无限空间上所展示的内在层面。