The divide-and-conquer method has been widely used for estimating large-scale kernel ridge regression estimates. Unfortunately, when the response variable is highly skewed, the divide-and-conquer kernel ridge regression (dacKRR) may overlook the underrepresented region and result in unacceptable results. We develop a novel response-adaptive partition strategy to overcome the limitation. In particular, we propose to allocate the replicates of some carefully identified informative observations to multiple nodes (local processors). The idea is analogous to the popular oversampling technique. Although such a technique has been widely used for addressing discrete label skewness, extending it to the dacKRR setting is nontrivial. We provide both theoretical and practical guidance on how to effectively over-sample the observations under the dacKRR setting. Furthermore, we show the proposed estimate has a smaller asymptotic mean squared error (AMSE) than that of the classical dacKRR estimate under mild conditions. Our theoretical findings are supported by both simulated and real-data analyses.
翻译:分化法被广泛用于估算大型内核脊回归估计值。不幸的是,当反应变量高度偏斜时,分化内核脊回归(dacKRR)可能会忽略代表性不足的区域,并导致无法接受的结果。我们制定了新的反应适应分治战略,以克服限制。我们特别提议将一些经仔细识别的信息性观测的复制件分配给多个节点(当地处理器),这与流行的过度采样技术类似。虽然这种技术被广泛用于处理离散标签偏差,将其扩大到达克RRR设置是非三重的。我们提供了理论和实践指导,说明如何有效地过度归纳在达克RR设置下的观测结果。此外,我们表明,拟议的估计数比在温和条件下的典型的达克KRR估计数(AMSE)的纯度平均正方差错误(AMSE)要小一些小。我们的理论结论得到模拟和真实数据分析的支持。