We devise coresets for kernel $k$-Means with a general kernel, and use them to obtain new, more efficient, algorithms. Kernel $k$-Means has superior clustering capability compared to classical $k$-Means, particularly when clusters are non-linearly separable, but it also introduces significant computational challenges. We address this computational issue by constructing a coreset, which is a reduced dataset that accurately preserves the clustering costs. Our main result is a coreset for kernel $k$-Means that works for a general kernel and has size $\mathrm{poly}(k\epsilon^{-1})$. Our new coreset both generalizes and greatly improves all previous results; moreover, it can be constructed in time near-linear in $n$. This result immediately implies new algorithms for kernel $k$-Means, such as a $(1+\epsilon)$-approximation in time near-linear in $n$, and a streaming algorithm using space and update time $\mathrm{poly}(k \epsilon^{-1} \log n)$. We validate our coreset on various datasets with different kernels. Our coreset performs consistently well, achieving small errors while using very few points. We show that our coresets can speed up kernel $k$-Means++ (the kernelized version of the widely used $k$-Means++ algorithm), and we further use this faster kernel $k$-Means++ for spectral clustering. In both applications, we achieve up to 1000x speedup while the error is comparable to baselines that do not use coresets.
翻译:我们为内核 $k$-Means 设计核心套件,用普通内核来获取新的、更有效率的算法。 Kernel $k$-Means 与古典的美元-Means 相比,拥有更高的组群能力,特别是当集群不线性分离时,它也带来了巨大的计算挑战。我们通过构建一个核心套件来解决这个计算问题,这是一个减少的数据集,可以准确保存集群成本。我们的主要结果是为内核 $k$-Means 的核心套件,该套件为一般内核工作,并且有美元-美元(poly}(klepsol)-Means) 的大小。我们的新核心套件既能概括,又能大大改进所有以前的结果;此外,它也可以在时间-线性内建一个核心套件,比如美元-美元-美元-套件-套件的内值-内值-套件-内值-美元-内值-内值-内值-内值-内值-内值的内值-值-内值-内值-内值的内值-内值-内值-内存的值-值-值-内值-内值-内值-内值-内值-值-内算算算算算-值-值-值-值-内值-内,并用一个基值-内基值-内-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内值-内存-内存-内存-内存-内存-内存-内-内-内-内-内-内-内-内-内存-内-内-内-内-内-内-内-内-内存-内-内-内-内存-内存-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内-内存-内存-内-内存-内存-内存-内存-