In data containing heterogeneous subpopulations, classification performance benefits from incorporating the knowledge of cluster structure in the classifier. Previous methods for such combined clustering and classification either 1) are classifier-specific and not generic, or 2) independently perform clustering and classifier training, which may not form clusters that can potentially benefit classifier performance. The question of how to perform clustering to improve the performance of classifiers trained on the clusters has received scant attention in previous literature, despite its importance in several real-world applications. In this paper, first, we theoretically analyze the generalization performance of classifiers trained on clustered data and find conditions under which clustering can potentially aid classification. This motivates the design of a simple k-means-based classification algorithm called Clustering Aware Classification (CAC) and its neural variant {DeepCAC}. DeepCAC effectively leverages deep representation learning to learn latent embeddings and finds clusters in a manner that make the clustered data suitable for training classifiers for each underlying subpopulation. Our experiments on synthetic and real benchmark datasets demonstrate the efficacy of DeepCAC over previous methods for combined clustering and classification.
翻译:在包含不同分类组别的数据中,分类工作从将集群结构知识纳入分类分类器中获得的分类绩效惠益。以前这类合并组别和分类的方法有1个是针对具体分类的,而不是一般性的,或2个是独立进行分类和分类培训,这些培训可能不会构成可能有利于分类者业绩的分类组别。尽管在几个现实世界应用中具有重要性,但以往文献很少注意如何进行分组以提高接受过分类组别培训的分类员的绩效的问题。在本文中,我们首先从理论上分析受过分类组别数据培训的分类员的一般化绩效,并找出集群可能帮助分类的条件。这促使设计一个简单的基于k手段的分类算法,称为集群意识分类(CAC)及其神经变异(DeepCAC}),并有效地利用深层代表性学习来学习潜在嵌入,发现集群数据以适合培训每个基础亚人口分类员的方式找到。我们关于合成和真实基准数据集的实验表明深 CAC相对于先前的组合和分类方法的功效。