In this paper, we derive a new dimension-free non-asymptotic upper bound for the quadratic $k$-means excess risk related to the quantization of an i.i.d sample in a separable Hilbert space. We improve the bound of order $\mathcal{O} \bigl( k / \sqrt{n} \bigr)$ of Biau, Devroye and Lugosi, by establishing a bound of order $\mathcal{O} \bigl(\log(n/k) \sqrt{k \log(k) / n} \, \bigr)$ where $k$ is the number of centers and $n$ the sample size. This is essentially optimal up to logarithmic factors since a lower bound of order $\mathcal{O} \bigl( \sqrt{k^{1 - 4/d}/n} \bigr)$ is known in dimension $d$. Our technique of proof is based on the linearization of the $k$-means criterion through a kernel trick and on PAC-Bayesian inequalities. To get a $1 / \sqrt{n}$ speed, we introduce a new PAC-Bayesian chaining method replacing the concept of $\delta$-net with the perturbation of the parameter by an infinite dimensional Gaussian process. In the meantime, we embed the usual $k$-means criterion into a broader family built upon the Kullback divergence and its underlying properties. This results in a new algorithm that we named information $k$-means, well suited to the clustering of bags of words. Based on considerations from information theory, we also introduce a new bounded $k$-means criterion that uses a scale parameter but satisfies a generalization bound that does not require any boundedness or even integrability conditions on the sample. We describe the counterpart of Lloyd's algorithm and prove generalization bounds for these new $k$-means criteria.
翻译:在本文中, 我们得出一个新的无维度的上方基流, 用于四维值 $k$- 表示美元( miglation) 。 在可分解的 Hilbert 空间中, i. d 样本的四分位化超风险。 我们改进了 $\ mathcal{ O}\ bigl( k/\ sqrt{ n}\ bigr) 美元( bigl) 的上方基值。 通过建立 $( sqrt{ mathal{O}\ biglocklation) 的组合, 美元( more) 美元( more) 美元( liver) 代表 美元( lider) 美元( lider) 。 美元( liver) =( liver) =( liver) 值( liver) =( liver) =( liver) =( lax) a legal lax) a likeal legal legal modeal) a a le le lemental cal lex) a a modeal modeal modeal cal a.