重新编组数据,以加强对集群的恢复保障 (Re-embedding data to strengthen recovery guarantees of clustering)

We propose a clustering method that involves chaining four known techniques into a pipeline yielding an algorithm with stronger recovery guarantees than any of the four components separately. Given $n$ points in $\mathbb R^d$, the first component of our pipeline, which we call leapfrog distances, is reminiscent of density-based clustering, yielding an $n\times n$ distance matrix. The leapfrog distances are then translated to new embeddings using multidimensional scaling and spectral methods, two other known techniques, yielding new embeddings of the $n$ points in $\mathbb R^{d'}$, where $d'$ satisfies $d'\ll d$ in general. Finally, sum-of-norms (SON) clustering is applied to the re-embedded points. Although the fourth step (SON clustering) can in principle be replaced by any other clustering method, our focus is on provable guarantees of recovery of underlying structure. Therefore, we establish that the re-embedding improves recovery SON clustering, since SON clustering is a well-studied method that already has provable guarantees.

翻译：我们建议采用集群方法,将四种已知技术连锁到管道中,产生一种比四个组成部分中任何一个组成部分都更强有力的回收保证的算法。考虑到美元=mathbb R ⁇ d$,我们管线的第一个组成部分,我们称之为跳蛙距离,是重复基于密度的集群,产生一个美元=时间=0美元=距离矩阵。然后,跳蛙距离转化为新的嵌入,使用多层面的缩放和光谱方法,另两种已知技术,产生美元=$=mathbb R ⁇ d'}$的新嵌入点,以美元=$=$=R ⁇ d'}$=美元,在那里,美元=美元=美元=美元=1美元=1美元。最后,对重新组合的点应用总量(SON)组合。虽然第四步(SON Group)原则上可以被任何其他的组合方法取代,但我们的侧重点是恢复基础结构的可变保证。因此,我们确定重新组合的恢复点将改进SON组合,因为SON集群是一种经过很好研究的保证方法。