Synthetic oversampling of minority examples using SMOTE and its variants is a leading strategy for addressing imbalanced classification problems. Despite the success of this approach in practice, its theoretical foundations remain underexplored. We develop a theoretical framework to analyze the behavior of SMOTE and related methods when classifiers are trained on synthetic data. We first derive a uniform concentration bound on the discrepancy between the empirical risk over synthetic minority samples and the population risk on the true minority distribution. We then provide a nonparametric excess risk guarantee for kernel-based classifiers trained using such synthetic data. These results lead to practical guidelines for better parameter tuning of both SMOTE and the downstream learning algorithm. Numerical experiments are provided to illustrate and support the theoretical findings
翻译:使用SMOTE及其变体对少数类样本进行合成过采样是处理类别不平衡分类问题的主要策略。尽管该方法在实践中取得了成功,但其理论基础仍未得到充分探索。我们建立了一个理论框架来分析SMOTE及相关方法在分类器基于合成数据训练时的行为。我们首先推导了合成少数类样本经验风险与真实少数类分布总体风险之间差异的一致集中界。随后,我们为基于此类合成数据训练的核分类器提供了非参数超额风险保证。这些结果为SMOTE及其下游学习算法的参数调优提供了实用指导准则。数值实验被用于说明并支持理论发现。