Clustering data lying close to a union of low-dimensional manifolds, with each manifold as a cluster, is a fundamental problem in machine learning. When the manifolds are assumed to be linear subspaces, many methods succeed using low-rank and sparse priors, which have been studied extensively over the past two decades. Unfortunately, most real-world datasets can not be well approximated by linear subspaces. On the other hand, several works have proposed to identify the manifolds by learning a feature map such that the data transformed by the map lie in a union of linear subspaces, even though the original data are from non-linear manifolds. However, most works either assume knowledge of the membership of samples to clusters, or are shown to learn trivial representations. In this paper, we propose to simultaneously perform clustering and learn a union-of-subspace representation via Maximal Coding Rate Reduction. Experiments on synthetic and realistic datasets show that the proposed method achieves clustering accuracy comparable with state-of-the-art alternatives, while being more scalable and learning geometrically meaningful representations.
翻译:集成数据接近低维元体的组合,每个元体作为一个组群,这是机器学习的一个根本问题。当假设这些元体是线性子空间时,许多方法都成功使用低级和稀疏的前科,过去二十年来对此进行了广泛研究。不幸的是,大多数真实世界数据集无法被线性子空间完全接近。另一方面,一些工作提议通过学习地貌图来识别元件,使地图转换的数据处于线性子空间的组合中,即使原始数据来自非线性多元体。然而,大多数工作要么假设样品组群成员的知识,要么显示其为微不足道的表示方式。在本文中,我们提议同时通过最大编码速度降低来进行组合和学习子空间的组合。合成和现实性数据集实验表明,拟议方法的组合准确性与最新替代方法相近,同时更加可缩放和学习具有地貌意义的表达方式。