Subspace clustering is the classical problem of clustering a collection of data samples that approximately lie around several low-dimensional subspaces. The current state-of-the-art approaches for this problem are based on the self-expressive model which represents the samples as linear combination of other samples. However, these approaches require sufficiently well-spread samples for accurate representation which might not be necessarily accessible in many applications. In this paper, we shed light on this commonly neglected issue and argue that data distribution within each subspace plays a critical role in the success of self-expressive models. Our proposed solution to tackle this issue is motivated by the central role of data augmentation in the generalization power of deep neural networks. We propose two subspace clustering frameworks for both unsupervised and semi-supervised settings that use augmented samples as an enlarged dictionary to improve the quality of the self-expressive representation. We present an automatic augmentation strategy using a few labeled samples for the semi-supervised problem relying on the fact that the data samples lie in the union of multiple linear subspaces. Experimental results confirm the effectiveness of data augmentation, as it significantly improves the performance of general self-expressive models.
翻译:子空间群集是围绕几个低维子空间进行数据收集的典型问题。目前对这个问题的最先进的方法基于自我表达模型,该模型将样品作为其他样品的线性组合。然而,这些方法需要足够广泛的样本,以便准确表达,而在许多应用中,这些样本不一定可以获取。在本文中,我们阐明了这个通常被忽视的问题,并争论说,每个子空间内的数据分布在自我表达模型的成功方面发挥着关键作用。我们提出解决这一问题的解决方案,其动机是数据增强在深神经网络一般化动力中的核心作用。我们建议为未受监督的和半受监督的两种环境建立两个子空间群集框架,使用增强的样本作为扩大的字典,以提高自我表达质量。我们提出了一个自动增强战略,利用少数标签样本解决半超强的问题,依靠数据样品存在于多个线性子空间的结合。实验结果证实了数据增强的有效性,因为它大大改进了一般自我表达模型的性能。