Imbalance in the proportion of training samples belonging to different classes often poses performance degradation of conventional classifiers. This is primarily due to the tendency of the classifier to be biased towards the majority classes in the imbalanced dataset. In this paper, we propose a novel three step technique to address imbalanced data. As a first step we significantly oversample the minority class distribution by employing the traditional Synthetic Minority OverSampling Technique (SMOTE) algorithm using the neighborhood of the minority class samples and in the next step we partition the generated samples using a Gaussian-Mixture Model based clustering algorithm. In the final step synthetic data samples are chosen based on the weight associated with the cluster, the weight itself being determined by the distribution of the majority class samples. Extensive experiments on several standard datasets from diverse domains shows the usefulness of the proposed technique in comparison with the original SMOTE and its state-of-the-art variants algorithms.
翻译:属于不同类别的培训样本比例的不平衡往往造成传统分类器的性能退化,这主要是因为分类器倾向于偏向不平衡数据集中的大多数类别。在本文件中,我们提出了解决不平衡数据的新颖的三步技术。作为第一步,我们使用传统的合成少数群体多采样技术(SMOTE)算法,利用少数类样本的周边,大大过度抽样分配少数群体类。 在下一步,我们使用基于高萨混合模型的集群算法对生成的样本进行分解。在最后一步,根据与组有关的重量选择合成数据样本,其重量由多数类样本的分布决定。对不同领域的多套标准数据集进行的广泛实验表明,与最初的SMOTE及其最先进的变异算法相比,拟议的方法非常有用。