Differences in data size per class, also known as imbalanced data distribution, have become a common problem affecting data quality. Big Data scenarios pose a new challenge to traditional imbalanced classification algorithms, since they are not prepared to work with such amount of data. Split data strategies and lack of data in the minority class due to the use of MapReduce paradigm have posed new challenges for tackling the imbalance between classes in Big Data scenarios. Ensembles have shown to be able to successfully address imbalanced data problems. Smart Data refers to data of enough quality to achieve high performance models. The combination of ensembles and Smart Data, achieved through Big Data preprocessing, should be a great synergy. In this paper, we propose a novel methodology based on Decision Trees Ensemble with Smart Data for addressing the imbalanced classification problem in Big Data domains, namely DeTE_SD methodology. This methodology is based on the learning of different decision trees using distributed quality data for the ensemble process. This quality data is achieved by fusing Random Discretization, Principal Components Analysis and clustering-based Random Oversampling for obtaining different Smart Data versions of the original data. Experiments carried out in 21 binary adapted datasets have shown that our methodology outperforms Random Forest.
翻译:每个类别的数据大小差异,又称数据分布不平衡,已成为影响数据质量的一个常见问题。大数据假设情景对传统的不平衡分类算法提出了新的挑战,因为它们不准备使用这类数量的数据。由于使用 MapReduce 模型,少数类别的数据战略和数据缺乏对解决大数据假设情景中各类别之间不平衡的问题提出了新的挑战。组合显示能够成功解决不平衡的数据问题。智能数据是指足够质量的数据,以达到高性能模型。通过大数据预处理实现的组合和智能数据组合,应该是一个巨大的协同效应。在本文件中,我们建议了一种基于决定树和智能数据组成的新颖方法,以解决大数据域中不平衡的分类问题,即DeTE_SD方法。这个方法基于利用分布式质量数据学习不同的决策树,用于堆积进程。这一质量数据是通过调试随机、主构件分析以及基于组合的随机随机过错数据组合组合实现的,以便获得不同智能数据版本。在原始方法中显示的Forestrolate 21 模型显示的随机模型。