This study is about inducing classifiers using data that is imbalanced, with a minority class being under-represented in relation to the majority classes. The first section of this research focuses on the main characteristics of data that generate this problem. Following a study of previous, relevant research, a variety of artificial, imbalanced data sets influenced by important elements were created. These data sets were used to create decision trees and rule-based classifiers. The second section of this research looks into how to improve classifiers by pre-processing data with resampling approaches. The results of the following trials are compared to the performance of distinct pre-processing re-sampling methods: two variants of random over-sampling and focused under-sampling NCR. This paper further optimises class imbalance with a new method called Sparsity. The data is made more sparse from its class centers, hence making it more homogenous.
翻译:这项研究涉及利用不平衡数据诱导分类人员,少数类别在多数类别中的代表性不足。本研究的第一部分侧重于产生这一问题的数据的主要特征。在对以前的相关研究进行研究之后,创建了受重要要素影响的各种人为和不平衡的数据集。这些数据组被用来创建决策树和基于规则的分类人员。本研究的第二部分研究如何通过采用再抽样方法的预处理数据改进分类人员。以下试验的结果与不同的处理前再抽样方法的绩效进行了比较:两种随机过度抽样和重点突出的低抽样NCR变式。本文用一种叫作“分级”的新方法进一步增加了类选取性不平衡。数据从其分类中心变得更少,从而使其更加同质化。