Differences in data size per class, also known as imbalanced data distribution, have become a common problem affecting data quality. Big Data scenarios pose a new challenge to traditional imbalanced classification algorithms, since they are not prepared to work with such amount of data. Split data strategies and lack of data in the minority class due to the use of MapReduce paradigm have posed new challenges for tackling the imbalance between classes in Big Data scenarios. Ensembles have shown to be able to successfully address imbalanced data problems. Smart Data refers to data of enough quality to achieve high performance models. The combination of ensembles and Smart Data, achieved through Big Data preprocessing, should be a great synergy. In this paper, we propose a novel Smart Data driven Decision Trees Ensemble methodology for addressing the imbalanced classification problem in Big Data domains, namely SD_DeTE methodology. This methodology is based on the learning of different decision trees using distributed quality data for the ensemble process. This quality data is achieved by fusing Random Discretization, Principal Components Analysis and clustering-based Random Oversampling for obtaining different Smart Data versions of the original data. Experiments carried out in 21 binary adapted datasets have shown that our methodology outperforms Random Forest.
翻译:每个类别的数据大小差异,又称数据分布不平衡,已成为影响数据质量的一个常见问题。大数据假设情景对传统的不平衡分类算法提出了新的挑战,因为它们不准备使用这类数量的数据。由于使用 MapReduce 模式,在少数类别中将数据战略和数据缺乏分开,对解决大数据假设情景中各类别之间不平衡的问题提出了新的挑战。组合显示能够成功解决不平衡的数据问题。智能数据是指足够质量的数据,以达到高性能模型。通过大数据预处理实现的组合和智能数据组合,应该是一个巨大的协同效应。在本文件中,我们提出一个新的智能数据驱动决策树组合,以弥补大数据域中不平衡的分类问题,即SD_DETE方法。这种方法的基础是利用分布式质量数据学习不同的决策树,以组合进程为对象。通过调试随机、主构件分析以及基于集的随机拼凑,实现这一质量数据,以获取不同原始数据版本为目的的智能数据组合组合组合组合组合组合组合,应该是一个巨大的协同效应。在21年的Forrows Foral 中演示了原始数据格式。