Imbalanced data is a frequently encountered problem in machine learning. Despite a vast amount of literature on sampling techniques for imbalanced data, there is a limited number of studies that address the issue of the optimal sampling ratio. In this paper, we attempt to fill the gap in the literature by conducting a large scale study of the effects of sampling ratio on classification accuracy. We consider 10 popular sampling methods and evaluate their performance over a range of ratios based on 20 datasets. The results of the numerical experiments suggest that the optimal sampling ratio is between 0.7 and 0.8 albeit the exact ratio varies depending on the dataset. Furthermore, we find that while factors such the original imbalance ratio or the number of features do not play a discernible role in determining the optimal ratio, the number of samples in the dataset may have a tangible effect.
翻译:尽管有大量关于不平衡数据抽样技术的文献,但涉及最佳抽样比率问题的研究数量有限,在本文件中,我们试图通过对抽样比率对分类准确性的影响进行大规模研究来填补文献中的空白,我们考虑了10种流行抽样方法,并根据20个数据集对各种比率的性能进行了评估。数字实验的结果表明,最佳抽样比率在0.7和0.8之间,尽管精确比率因数据集而异。 此外,我们发现,虽然原始的不平衡比率或特征数目等因素在确定最佳比率方面没有发挥明显的作用,但数据集中的样品数目可能具有实际效果。