Kernel ridge regression (KRR) is a popular scheme for non-linear non-parametric learning. However, existing implementations of KRR require that all the data is stored in the main memory, which severely limits the use of KRR in contexts where data size far exceeds the memory size. Such applications are increasingly common in data mining, bioinformatics, and control. A powerful paradigm for computing on data sets that are too large for memory is the streaming model of computation, where we process one data sample at a time, discarding each sample before moving on to the next one. In this paper, we propose StreaMRAK - a streaming version of KRR. StreaMRAK improves on existing KRR schemes by dividing the problem into several levels of resolution, which allows continual refinement to the predictions. The algorithm reduces the memory requirement by continuously and efficiently integrating new samples into the training model. With a novel sub-sampling scheme, StreaMRAK reduces memory and computational complexities by creating a sketch of the original data, where the sub-sampling density is adapted to the bandwidth of the kernel and the local dimensionality of the data. We present a showcase study on two synthetic problems and the prediction of the trajectory of a double pendulum. The results show that the proposed algorithm is fast and accurate.
翻译:Kernel 脊脊回归( KRR) 是非线性非参数性学习的流行方案。 然而, KRR的现有实施要求所有数据都存储在主记忆中,这严重限制了KRR在数据大小远远超过内存大小的情况下的使用。 在数据挖掘、生物信息学和控制中,这种应用越来越常见。在数据采集、生物信息学和控制中,计算过大而无法记忆的数据集的强大模式是流程式的计算模型,即我们同时处理一个数据样本,丢弃每个样本,然后转到下一个数据。在本文件中,我们提议StraMRAK- KRR流版。 StreaMRAK通过将问题分为几个分辨率来改进现有的KRRR计划,这样可以不断改进预测。算法通过持续和高效地将新的样本纳入培训模型来减少记忆要求。 StreamMRAK通过制作原始数据的素描图来减少记忆和计算复杂性,在原始数据中,将次抽样密度调整为 KRRRRRRRK的精确密度。 Streamal-ximal Staxal 和本地数据演示的双层。