Data imbalance remains one of the most widespread problems affecting contemporary machine learning. The negative effect data imbalance can have on the traditional learning algorithms is most severe in combination with other dataset difficulty factors, such as small disjuncts, presence of outliers and insufficient number of training observations. Aforementioned difficulty factors can also limit the applicability of some of the methods of dealing with data imbalance, in particular the neighborhood-based oversampling algorithms based on SMOTE. Radial-Based Oversampling (RBO) was previously proposed to mitigate some of the limitations of the neighborhood-based methods. In this paper we examine the possibility of utilizing the concept of mutual class potential, used to guide the oversampling process in RBO, in the undersampling procedure. Conducted computational complexity analysis indicates a significantly reduced time complexity of the proposed Radial-Based Undersampling algorithm, and the results of the performed experimental study indicate its usefulness, especially on difficult datasets.
翻译:数据不平衡仍然是影响当代机器学习的最普遍问题之一。数据不平衡对传统学习算法的消极影响可能最为严重,再加上其他数据集困难因素,如小型脱钩、外部线的存在和培训观测数量不足等。前面提到的困难因素还可能限制一些处理数据不平衡的方法的适用性,特别是基于SMOTE的以邻里为基础的过度抽样算法。以前曾提出过以辐射为基础的过度抽样法(RBO),以减轻以邻里为基础的方法的某些局限性。本文中我们研究了利用相互阶级潜力概念的可能性,这一概念用于指导在ROB的抽取过程,在抽取不足的程序中。进行计算的复杂性分析表明,拟议的Radial-Based抽取算法的时间复杂性大为降低,而所进行的实验研究的结果显示,特别是在困难的数据集方面,该算法是有用的。