For several years till date, the major issues in terms of solving for classification problems are the issues of Imbalanced data. Because majority of the machine learning algorithms by default assumes all data are balanced, the algorithms do not take into consideration the distribution of the data sample class. The results tend to be unsatisfactory and skewed towards the majority sample class distribution. This implies that the consequences as a result of using a model built using an Imbalanced data without handling for the Imbalance in the data could be misleading both in practice and theory. Most researchers have focused on the application of Synthetic Minority Oversampling Technique (SMOTE) and Adaptive Synthetic (ADASYN) Sampling Approach in handling data Imbalance independently in their works and have failed to better explain the algorithms behind these techniques with computed examples. This paper focuses on both synthetic oversampling techniques and manually computes synthetic data points to enhance easy comprehension of the algorithms. We analyze the application of these synthetic oversampling techniques on binary classification problems with different Imbalanced ratios and sample sizes.
翻译:在迄今为止的几年里,解决分类问题的主要问题是数据不平衡问题。由于大多数机算学习算法默认假定所有数据都是均衡的,因此算法没有考虑到数据抽样类的分布情况。结果往往不令人满意,偏向于大多数抽样类的分布。这意味着,使用使用不处理数据平衡的模型而不处理数据平衡的模型,在实践和理论上都可能产生误导。大多数研究人员都侧重于在独立处理数据抽样类时应用合成少数群体过度采样技术(SMOTE)和适应性合成技术(ADASYN)抽样方法,未能用计算的例子更好地解释这些技术背后的算法。本文侧重于合成过度采样技术和人工编篡合成数据点,以便更方便地理解算法。我们分析了这些合成过度采样技术在具有不同平衡比率和抽样大小的二进制分类问题方面的应用情况。