Imbalanced data occurs in a wide range of scenarios. The skewed distribution of the target variable elicits bias in machine learning algorithms. One of the popular methods to combat imbalanced data is to artificially balance the data through resampling. In this paper, we compare the efficacy of a recently proposed kernel density estimation (KDE) sampling technique in the context of artificial neural networks. We benchmark the KDE sampling method against two base sampling techniques and perform comparative experiments using 8 datasets and 3 neural networks architectures. The results show that KDE sampling produces the best performance on 6 out of 8 datasets. However, it must be used with caution on image datasets. We conclude that KDE sampling is capable of significantly improving the performance of neural networks.
翻译:目标变量的偏差分布在机器学习算法中引起偏差。 打击不平衡数据的流行方法之一是通过再抽样人为平衡数据。 在本文中,我们比较了最近提议的内核密度估计(KDE)采样技术在人工神经网络中的效率。 我们用两个基点取样技术对 KDE取样方法进行基准测试,并使用8个数据集和3个神经网络结构进行比较试验。结果显示, KDE取样在8个数据集中的6个中产生最佳性能。 但是,必须谨慎地使用它作为图像数据集。 我们的结论是, KDE取样能够显著改善神经网络的性能。