Spectral clustering is a leading and popular technique in unsupervised data analysis. Two of its major limitations are scalability and generalization of the spectral embedding (i.e., out-of-sample-extension). In this paper we introduce a deep learning approach to spectral clustering that overcomes the above shortcomings. Our network, which we call SpectralNet, learns a map that embeds input data points into the eigenspace of their associated graph Laplacian matrix and subsequently clusters them. We train SpectralNet using a procedure that involves constrained stochastic optimization. Stochastic optimization allows it to scale to large datasets, while the constraints, which are implemented using a special-purpose output layer, allow us to keep the network output orthogonal. Moreover, the map learned by SpectralNet naturally generalizes the spectral embedding to unseen data points. To further improve the quality of the clustering, we replace the standard pairwise Gaussian affinities with affinities leaned from unlabeled data using a Siamese network. Additional improvement can be achieved by applying the network to code representations produced, e.g., by standard autoencoders. Our end-to-end learning procedure is fully unsupervised. In addition, we apply VC dimension theory to derive a lower bound on the size of SpectralNet. State-of-the-art clustering results are reported on the Reuters dataset. Our implementation is publicly available at https://github.com/kstant0725/SpectralNet .
翻译:光谱群集是未经监督的数据分析中的一种领先和流行的技术。 其中两个主要的局限性是光谱嵌入的缩放和一般化( 即, 缩放外扩展) 。 在本文中, 我们引入了一种克服上述缺陷的光谱群集的深层次学习方法 。 我们称为 SepectralNet 的网络, 学习了一张将输入数据点嵌入相关图形 Laplaceian 矩阵的密封空间的地图, 并随后将其组合起来 。 我们用一个限制随机优化的程序来培训 SpectralNet 网络。 托盘优化允许它缩到大型数据集, 而使用特殊目的输出层执行的限制, 允许我们保持网络的输出或图。 此外, SpectralNet 所学的地图自然将光谱嵌入到隐蔽的数据点。 为了进一步提高组合的质量, 我们用一个Siamseamese网络的近亲近似性数据缩略图 。 额外的改进可以应用网络的图像 。 我们的常规 将数据应用到 标准的常规 。 我们的常规化 将数据 将数据应用到 。 我们的图像 将数据应用到 。 在 上, 我们的常规 将数据应用到不透明化到 。