Clustering methods seek to partition data such that elements are more similar to elements in the same cluster than to elements in different clusters. The main challenge in this task is the lack of a unified definition of a cluster, especially for high dimensional data. Different methods and approaches have been proposed to address this problem. This paper continues the study originated by Efimov, Adamyan and Spokoiny (2019) where a novel approach to adaptive nonparametric clustering called Adaptive Weights Clustering (AWC) was offered. The method allows analyzing high-dimensional data with an unknown number of unbalanced clusters of arbitrary shape under very weak modeling assumptions. The procedure demonstrates a state-of-the-art performance and is very efficient even for large data dimension D. However, the theoretical study in Efimov, Adamyan and Spokoiny (2019) is very limited and did not really address the question of efficiency. This paper makes a significant step in understanding the remarkable performance of the AWC procedure, particularly in high dimension. The approach is based on combining the ideas of adaptive clustering and manifold learning. The manifold hypothesis means that high dimensional data can be well approximated by a d-dimensional manifold for small d helping to overcome the curse of dimensionality problem and to get sharp bounds on the cluster separation which only depend on the intrinsic dimension d. We also address the problem of parameter tuning. Our general theoretical results are illustrated by some numerical experiments.
翻译:集束方法力求将数据分割,使元素与同一组群中元素比不同组群中元素更相似。这一任务的主要挑战是缺乏对组群的统一定义,特别是对于高维数据而言。提出了不同的方法和方针来解决这一问题。本文件继续了Efimov、Adamyan和Spokoiny(2019年)的理论研究。Efimov、Adamyan和Spokoiny(2019年)的理论研究,在Efimov、Adamyan和Spokoiny(2019年)的理论研究中,提出了一种创新的适应性非参数组合方法,称为适应性视觉组合(AWC),该方法允许在非常薄弱的模型假设下,分析高维数据与数量不为奇的任意形状群集分析。该程序显示高维数据的表现非常先进,而且非常高效,甚至对大型数据层面也非常有效。然而,Efimov、Adamyan和Spokoini(2019年)的理论研究非常有限,没有真正解决效率问题。本文件在理解《公约》程序的出色业绩方面迈出了一大步步,特别是在高维组群集和多重学习的理念上。多重假设意味着高维数据可以很好地以克服我们的小层问题。