Spectral clustering became a popular choice for data clustering for its ability of uncovering clusters of different shapes. However, it is not always preferable over other clustering methods due to its computational demands. One of the effective ways to bypass these computational demands is to perform spectral clustering on a subset of points (data representatives) then generalize the clustering outcome, this is known as approximate spectral clustering (ASC). ASC uses sampling or quantization to select data representatives. This makes it vulnerable to 1) performance inconsistency (since these methods have a random step either in initialization or training), 2) local statistics loss (because the pairwise similarities are extracted from data representatives instead of data points). We proposed a refined version of $k$-nearest neighbor graph, in which we keep data points and aggressively reduce number of edges for computational efficiency. Local statistics were exploited to keep the edges that do not violate the intra-cluster distances and nullify all other edges in the $k$-nearest neighbor graph. We also introduced an optional step to automatically select the number of clusters $C$. The proposed method was tested on synthetic and real datasets. Compared to ASC methods, the proposed method delivered a consistent performance despite significant reduction of edges.
翻译:光谱集群因其能够发现不同形状的群集而成为数据集群的流行选择。然而,由于计算需求,它并不总是比其他群集方法更可取。绕过这些计算需求的有效方法之一是在一个子点(数据代表)上进行光谱集群,然后将群集结果普遍化,这被称为近似光谱集群。ASC使用抽样或量度来选择数据代表。这使得它容易出现以下情况:(1)性能不一致(因为这些方法要么在初始化或培训中有一个随机步骤 ) ;(2) 当地统计损失(因为从数据代表而不是数据点中提取对等相似之处)。我们建议了一个精细化的美元最近邻图形版本,在其中我们保留数据点并大力减少计算效率的边缘数。本地统计数据被用来保持不侵犯群集内部距离的边缘,并消除最近邻图中的所有其他边缘。我们还引入了一个可选步骤,以自动选择美元组数。拟议的方法在合成和真实的边框上进行了测试,尽管采用了一致的性能降低方法。比较了ASC。