面向大数据的渐进式集成学习方法与分布式算法研究

项目名称： 面向大数据的渐进式集成学习方法与分布式算法研究

项目编号： No.61473194

项目类型： 面上项目

立项/批准年度： 2015

项目学科： 其他

项目作者： 黄哲学

作者单位： 深圳大学

项目金额： 80万元

中文摘要： 大数据分析算法研究的挑战之一是分布式算法的数据可扩展性。针对这一挑战，本项目提出渐进式集成学习策略，在内存和计算资源有限的条件下，通过采用部分数据逐步分批计算来建立集成学习模型，提高分布式算法处理大数据的能力，使之能解决TB级大数据分类问题。本项目的目标是，针对大数据高可扩展分布式分析算法的研究，提出新的理论、方法、框架和实现技术。主要研究内容包括：（1）基于随机样本子集划分的渐进式集成学习方法和统计原理；（2）大数据随机样本子集划分的抽样方法和分布式算法；（3）渐进式集成学习分布式算法框架、渐进式随机森林算法和MapReduce实现；（4）渐进式集成学习算法在智能电网大数据分类和预测中应用。本项目的预期研究成果将为渐进式集成学习方法提供理论基础，为基于渐进式集成学习的分布式算法研究提供算法框架，为大数据分类与预测应用提供高可扩展的随机森林分析技术，促进我国大数据领域的技术创新和产业应用。

中文关键词： 大数据；集成学习；分布式算法；分类算法；分布式数据挖掘

英文摘要： One big challenge in big data analysis is the scalability of distributed analysis algorithms. To solve this problem, this project proposal proposes an asymptotic ensemble learning strategy that builds an ensemble learning model in steps and each step only uses a small portion of big data to compute a subset of component models distributedly. The final model is the ensemble of the subsets of component models learnt in all steps. This learning strategy can significantly increase the ability of distributed big data analysis on a platform with memory and computing constraints and scale to terabytes data in learning classification models. The objectives of this project are to study new theory and methods for research of distributed algorithms scalable to big data and develop a distributed framework for implementation of asymptotic ensemble learning algorithms. The research tasks include: (1) studying the asymptotic ensemble learning method and the statistical theory based on partition of a big data set by random sample subsets. (2) developing sampling methods of subsets of random samples in big data partition and distributed algorithms. (3) developing a distributed framework of asymptotic ensemble learning, asymptotic distributed random forests algorithms and MapReduce implementations. (4) applying asymptotic distributed random forests algorithms to smart grid big data for classification and prediction. The expected outcomes of this project will set up a theoretical foundation of asymptotic ensemble learning and provide an algorithm framework for development of asymptotic distributed ensemble learning algorithms. The outcomes will also provide new sacalable random forests technology for applications of big data classification and prediction. The research results will promote technology innovation in big data area and big data applications in China.

英文关键词： big data;ensemble learning;distributed algorithms;classification algorithms;distributed data mining

成为VIP会员查看完整内容

相关内容

大数据

关注 270

从各种各样类型的数据中，快速获得有价值信息的能力，就是大数据技术。明白这一点至关重要，也正是这一点促使该技术具备走向众多企业的潜力。大数据的4个“V”，或者说特点有四个层面：第一，数据体量巨大。从TB级别，跃升到PB级别；第二，数据类型繁多。前文提到的网络日志、视频、图片、地理位置信息等等。第三，价值密度低。以视频为例，连续不间断监控过程中，可能有用的数据仅仅有一两秒。第四，处理速度快。

【牛津大学】多级蒙特卡洛方法，70页pdf

专知会员服务

60+阅读 · 2022年2月3日

大数据时代的地学知识图谱研究展望

专知会员服务

34+阅读 · 2021年11月25日

精准智能理论: 面向复杂动态对象的人工智能

专知会员服务

44+阅读 · 2021年4月13日

分布式深度学习训练网络综述

专知会员服务

48+阅读 · 2021年2月2日