项目名称: 面向大数据的渐进式集成学习方法与分布式算法研究
项目编号: No.61473194
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 其他
项目作者: 黄哲学
作者单位: 深圳大学
项目金额: 80万元
中文摘要: 大数据分析算法研究的挑战之一是分布式算法的数据可扩展性。针对这一挑战,本项目提出渐进式集成学习策略,在内存和计算资源有限的条件下,通过采用部分数据逐步分批计算来建立集成学习模型,提高分布式算法处理大数据的能力,使之能解决TB级大数据分类问题。本项目的目标是,针对大数据高可扩展分布式分析算法的研究,提出新的理论、方法、框架和实现技术。主要研究内容包括:(1)基于随机样本子集划分的渐进式集成学习方法和统计原理;(2)大数据随机样本子集划分的抽样方法和分布式算法;(3)渐进式集成学习分布式算法框架、渐进式随机森林算法和MapReduce实现;(4)渐进式集成学习算法在智能电网大数据分类和预测中应用。本项目的预期研究成果将为渐进式集成学习方法提供理论基础,为基于渐进式集成学习的分布式算法研究提供算法框架,为大数据分类与预测应用提供高可扩展的随机森林分析技术,促进我国大数据领域的技术创新和产业应用。
中文关键词: 大数据;集成学习;分布式算法;分类算法;分布式数据挖掘
英文摘要: One big challenge in big data analysis is the scalability of distributed analysis algorithms. To solve this problem, this project proposal proposes an asymptotic ensemble learning strategy that builds an ensemble learning model in steps and each step only uses a small portion of big data to compute a subset of component models distributedly. The final model is the ensemble of the subsets of component models learnt in all steps. This learning strategy can significantly increase the ability of distributed big data analysis on a platform with memory and computing constraints and scale to terabytes data in learning classification models. The objectives of this project are to study new theory and methods for research of distributed algorithms scalable to big data and develop a distributed framework for implementation of asymptotic ensemble learning algorithms. The research tasks include: (1) studying the asymptotic ensemble learning method and the statistical theory based on partition of a big data set by random sample subsets. (2) developing sampling methods of subsets of random samples in big data partition and distributed algorithms. (3) developing a distributed framework of asymptotic ensemble learning, asymptotic distributed random forests algorithms and MapReduce implementations. (4) applying asymptotic distributed random forests algorithms to smart grid big data for classification and prediction. The expected outcomes of this project will set up a theoretical foundation of asymptotic ensemble learning and provide an algorithm framework for development of asymptotic distributed ensemble learning algorithms. The outcomes will also provide new sacalable random forests technology for applications of big data classification and prediction. The research results will promote technology innovation in big data area and big data applications in China.
英文关键词: big data;ensemble learning;distributed algorithms;classification algorithms;distributed data mining