Data imbalance remains one of the factors negatively affecting the performance of contemporary machine learning algorithms. One of the most common approaches to reducing the negative impact of data imbalance is preprocessing the original dataset with data-level strategies. In this paper we propose a unified framework for imbalanced data over- and undersampling. The proposed approach utilizes radial basis functions to preserve the original shape of the underlying class distributions during the resampling process. This is done by optimizing the positions of generated synthetic observations with respect to the potential resemblance loss. The final Potential Anchoring algorithm combines over- and undersampling within the proposed framework. The results of the experiments conducted on 60 imbalanced datasets show outperformance of Potential Anchoring over state-of-the-art resampling algorithms, including previously proposed methods that utilize radial basis functions to model class potential. Furthermore, the results of the analysis based on the proposed data complexity index show that Potential Anchoring is particularly well suited for handling naturally complex (i.e. not affected by the presence of noise) datasets.
翻译:数据不平衡仍然是影响当代机器学习算法业绩的不利因素之一。减少数据不平衡负面影响的最常见办法之一是以数据级战略预先处理原始数据集。在本文件中,我们提议了数据过度和取样不足的不平衡统一框架。拟议办法利用辐射基函数来保持重新采样过程中基本类分配的原始形状。这是通过优化生成的合成观测在潜在相似损失方面的定位来完成的。最终潜在拼合算法结合了拟议框架中的交叉和下游取样。对60个不平衡数据集进行的实验结果显示,潜在拼凑超过最新抽样算法的性能,包括以前提出的利用辐射基函数模拟分类潜力的方法。此外,根据拟议数据复杂指数进行的分析结果表明,潜在拼凑特别适合处理自然复杂的数据集(即不受噪音影响)。