Nystr\"om approximation is a fast randomized method that rapidly solves kernel ridge regression (KRR) problems through sub-sampling the n-by-n empirical kernel matrix appearing in the objective function. However, the performance of such a sub-sampling method heavily relies on correctly estimating the statistical leverage scores for forming the sampling distribution, which can be as costly as solving the original KRR. In this work, we propose a linear time (modulo poly-log terms) algorithm to accurately approximate the statistical leverage scores in the stationary-kernel-based KRR with theoretical guarantees. Particularly, by analyzing the first-order condition of the KRR objective, we derive an analytic formula, which depends on both the input distribution and the spectral density of stationary kernels, for capturing the non-uniformity of the statistical leverage scores. Numerical experiments demonstrate that with the same prediction accuracy our method is orders of magnitude more efficient than existing methods in selecting the representative sub-samples in the Nystr\"om approximation.
翻译:Nystr\"om 近似值是一种快速随机的方法,通过对目标函数中出现的 n- by- 经验性内核矩阵进行子抽样,迅速解决内核脊回归( KRRR) 的问题。然而,这种子抽样方法的性能在很大程度上依赖于正确估计用于形成抽样分布的统计杠杆分数,这与解决原始 KRR 一样昂贵。在这项工作中,我们提出了一个线性时间算法(modulo 多元术语),以理论保证来准确估计基于静止内核的 KRR 中的统计杠杆分数。特别是,通过分析 KRR 目标的第一阶状态,我们得出一种分析公式,它既取决于输入分布,也取决于静止内核的光谱密度,以捕捉到统计杠杆分数的非统一性。 数值实验表明,用同样的预测精确度,我们的方法比现有方法在选择 Nystr\\ omiming 中具有代表性的子样本时效率更高。