Subspace clustering (SC) aims to cluster data lying in a union of low-dimensional subspaces. Usually, SC learns an affinity matrix and then performs spectral clustering. Both steps suffer from high time and space complexity, which leads to difficulty in clustering large datasets. This paper presents a method called k-Factorization Subspace Clustering (k-FSC) for large-scale subspace clustering. K-FSC directly factorizes the data into k groups via pursuing structured sparsity in the matrix factorization model. Thus, k-FSC avoids learning affinity matrix and performing eigenvalue decomposition, and has low (linear) time and space complexity on large datasets. This paper proves the effectiveness of the k-FSC model theoretically. An efficient algorithm with convergence guarantee is proposed to solve the optimization of k-FSC. In addition, k-FSC is able to handle sparse noise, outliers, and missing data, which are pervasive in real applications. This paper also provides online extension and out-of-sample extension for k-FSC to handle streaming data and cluster arbitrarily large datasets. Extensive experiments on large-scale real datasets show that k-FSC and its extensions outperform state-of-the-art methods of subspace clustering.
翻译:子空间群集(SC) 旨在将位于低维子空间结合中的数据分组。 通常, SC会学习亲和矩阵, 然后进行光谱集聚。 两个步骤都有高时空复杂性, 导致大量数据集群集的难度。 本文为大型子空间群集提供了一个叫做 k- FAC 的K- FSC 集聚( k- FSC ) 的方法。 K- FSC 通过在矩阵因子化模型中追求结构分散化, 直接将数据分解成 k组。 因此, k- FSC 避免学习亲近矩阵, 并进行eigenvalue分解, 在大型数据集中存在低( 线性) 时间和空间复杂性。 本文证明了 k- FSC 模型在理论上的有效性。 提议了一种有效的算法, 以优化 K- FSC 的子群集。 此外, k- FSC 能够处理分散的噪音、 外部和缺失的数据。 该文件还为 kFSC 提供了 kFSC 处理流流流式数据和大规模数据分组扩展方法, 显示其大规模数据分组的KFSC 扩展 。