Clustering algorithms partition a dataset into groups of similar points. The clustering problem is very general, and different partitions of the same dataset could be considered correct and useful. To fully understand such data, it must be considered at a variety of scales, ranging from coarse to fine. We introduce the Multiscale Environment for Learning by Diffusion (MELD) data model, which is a family of clusterings parameterized by nonlinear diffusion on the dataset. We show that the MELD data model precisely captures latent multiscale structure in data and facilitates its analysis. To efficiently learn the multiscale structure observed in many real datasets, we introduce the Multiscale Learning by Unsupervised Nonlinear Diffusion (M-LUND) clustering algorithm, which is derived from a diffusion process at a range of temporal scales. We provide theoretical guarantees for the algorithm's performance and establish its computational efficiency. Finally, we show that the M-LUND clustering algorithm detects the latent structure in a range of synthetic and real datasets.
翻译:组合算法将数据集分成相似的一组。 组合问题非常笼统, 同一数据集的不同分区可以被认为是正确和有用的。 要充分理解这些数据, 就必须在从粗略到细微的不同尺度上考虑这些数据。 我们引入了“ 通过扩散学习的多尺度环境” 数据模型, 这是一种由非线性扩散在数据集上参数化的组群组成的组合组合。 我们显示, MELD 数据模型准确地捕捉了数据中的潜伏多尺度结构, 便于其分析。 为了有效地了解许多真实数据集中观测到的多尺度结构, 我们引入了“ 由非超线性非线性扩散( M- LUND) 群集算法( M- LUND) 多尺度学习 ”, 这是从一系列时间尺度的传播过程衍生出来的。 我们对算法的性功能提供理论保证, 并确立其计算效率。 最后, 我们显示 M- LUND 组合算法在一系列合成和真实数据集中检测了潜值结构 。