We aim to demonstrate in experiments that our cost sensitive PEGASOS SVM achieves good performance on imbalanced data sets with a Majority to Minority Ratio ranging from 8.6:1 to 130:1 and to ascertain whether the including intercept (bias), regularization and parameters affects performance on our selection of datasets. Although many resort to SMOTE methods, we aim for a less computationally intensive method. We evaluate the performance by examining the learning curves. These curves diagnose whether we overfit or underfit or we choose over representative or under representative training/test data. We will also see the background of the hyperparameters versus the test and train error in validation curves. We benchmark our PEGASOS Cost-Sensitive SVM's results of Ding's LINEAR SVM DECIDL method. He obtained an ROC-AUC of .5 in one dataset. Our work will extend the work of Ding by incorporating kernels into SVM. We will use Python rather than MATLAB as python has dictionaries for storing mixed data types during multi-parameter cross-validation.
翻译:我们旨在通过实验表明,我们的代价敏感PEGASOS支持向量机在大多数至少比少数的比率范围内为8.6:1至130:1的不平衡数据集上实现了良好的性能,并确定包括截距(偏差)、正则化和参数是否影响对我们的数据集选择的性能。尽管许多人倾向于使用SMOTE方法,但我们旨在找到一种计算成本较低的方法。我们通过检查学习曲线来评估性能。这些曲线诊断我们是否过度拟合或欠拟合,或者我们选择了过度代表性或欠代表性的训练/测试数据。我们还将查看超参数与验证曲线中的测试和训练误差背景。我们将基准测试我们的PEGASOS代价敏感SVM在Ding的LINEAR SVM DECIDL方法的结果。他在一个数据集中获得了.5的ROC-AUC。我们的工作将通过将核函数引入SVM来扩展Ding的工作。我们将使用Python而不是MATLAB,因为Python具有用于在多参数交叉验证期间存储混合数据类型的字典。