There is often a mixture of very frequent labels and very infrequent labels in multi-label datatsets. This variation in label frequency, a type class imbalance, creates a significant challenge for building efficient multi-label classification algorithms. In this paper, we tackle this problem by proposing a minority class oversampling scheme, UCLSO, which integrates Unsupervised Clustering and Label-Specific data Oversampling. Clustering is performed to find out the key distinct and locally connected regions of a multi-label dataset (irrespective of the label information). Next, for each label, we explore the distributions of minority points in the cluster sets. Only the minority points within a cluster are used to generate the synthetic minority points that are used for oversampling. Even though the cluster set is the same across all labels, the distributions of the synthetic minority points will vary across the labels. The training dataset is augmented with the set of label-specific synthetic minority points, and classifiers are trained to predict the relevance of each label independently. Experiments using 12 multi-label datasets and several multi-label algorithms show that the proposed method performed very well compared to the other competing algorithms.
翻译:多标签数据集中往往存在非常频繁的标签和非常不常见的标签。 标签频率的这种差异, 类型类的不平衡, 给建立高效的多标签分类算法带来重大挑战。 在本文中, 我们通过提出一个少数类的过度抽样计划( UCLSO)来解决这个问题, 这个计划整合了不受监督的分组和标签特定数据 过度抽样 。 分组是为了找出多标签数据集中关键的独特和本地连接的区域( 不论标签信息 ) 。 其次, 对于每个标签, 我们探索组群中少数点的分布。 只有组群中的少数点被用于生成合成少数群体点, 用于过度抽样 。 尽管组群集在所有标签中都是一样的, 合成少数群体点的分布会因标签的不同而不同。 培训数据集将随着标签特定合成少数群体点的组合而扩大, 并且对分类员进行独立预测每个标签的相关性的培训。 使用12个多标签数据集和数个多标签算法的实验显示, 将其他的算法相互竞争。