面向海量超高维数据的随机森林算法理论及优化方法研究

项目名称： 面向海量超高维数据的随机森林算法理论及优化方法研究

项目编号： No.61203294

项目类型： 青年科学基金项目

立项/批准年度： 2013

项目学科： 自动化学科

项目作者： 李俊杰

作者单位： 深圳大学

项目金额： 25万元

中文摘要： 具有成千上万个属性的大规模超高维数据给现有分类算法带来前所未有的挑战，目前常用的算法对超高维数据分类的精度低，无法处理大数据。许多研究结果表明，随机森林分类算法处理高维数据优于其它分类算法，但用于TB级规模超高维数据的分类建模还有很大瓶颈。本项目基于前期的研究成果，进一步从理论和算法优化两个方面深入研究大数据分类的随机森林技术。研究内容包括：1）从理论上证明在超高维数据条件下，属性加权子空间抽样随机森林算法的精度不低于Breiman随机森林算法的精度，进一步丰富随机森林算法的理论；2）针对数据属性繁杂问题，研究多种决策树算法同时运用的混合随机森林优化方法和动态交互式随机森林优化方法，弥补目前采用单一决策树算法的缺陷； 3）针对数据规模大问题，开发基于MapReduce编程模型的高可扩展随机森林算法和实验系统，突破TB级大数据分类的技术瓶颈。预期成果将为大数据分类提供新的理论及应用工具。

中文关键词： 随机森林算法；海量数据挖掘；数据分类；机器学习；

英文摘要： Thousands upon thousands of features of ultra-high-dimensional large scale data become a new challenge to classification algorithms. Currently, the accuracy of most classification algorithms is low for ultra-high-dimensional data, and most algorithms cannot process large scale data. Many research works have shown that the random forest algorithm outperforms other classification algorithms in high dimensional data. But it still has the bottleneck to process TB scale ultra-high-dimensional data. This project will further improve the random forest theory and optimize the algorithm to process bigger data based on our preliminary works. The major tasks of this project include: 1) Prove that the accuracy of the weighted subspace sampling random forest algorithm is higher than Breiman's approach in ultra-high-dimensional data. The proof will enrich the theories of random forest algorithms. 2) For the problems of complex data, design a hybrid random forest algorithm, which builds multiple decision trees simultaneously with different decition tree algorithms; and design an interactive random forest optimization method, which reduces the shortcomings of random forests built with a single decision tree algorithm. 3) For the problems of large scale data, design a MapReduce scalable random forest algorithm and experimental p

英文关键词： Random Forest Algorithm；Massive Data Mining；Data Classification；Machine Learning；

成为VIP会员查看完整内容