Despite the ubiquity of kernel-based clustering, surprisingly few statistical guarantees exist beyond settings that consider strong structural assumptions on the data generation process. In this work, we take a step towards bridging this gap by studying the statistical performance of kernel-based clustering algorithms under non-parametric mixture models. We provide necessary and sufficient separability conditions under which these algorithms can consistently recover the underlying true clustering. Our analysis provides guarantees for kernel clustering approaches without structural assumptions on the form of the component distributions. Additionally, we establish a key equivalence between kernel-based data-clustering and kernel density-based clustering. This enables us to provide consistency guarantees for kernel-based estimators of non-parametric mixture models. Along with theoretical implications, this connection could have practical implications, including in the systematic choice of the bandwidth of the Gaussian kernel in the context of clustering.
翻译:尽管以内核为主的集群普遍存在,但令人惊讶的是,除了考虑数据生成过程的强有力的结构假设的环境之外,几乎没有什么统计保障。在这项工作中,我们通过研究非参数混合模型下以内核为主的集群算法的统计性能,朝着缩小这一差距迈出了一步。我们提供了必要和充分的分离性条件,使这些算法能够持续恢复基本真实的集群。我们的分析为内核集群方法提供了保障,而没有关于组成部分分布形式的结构性假设。此外,我们在以内核为主的数据集群和内核以内核为主的集群之间建立了关键的等值。这使我们能够为非参数混合模型以内核的估测算器提供一致性保证。除了理论影响外,这种联系还会产生实际影响,包括在集群中系统选择高斯内核的带宽。