项目名称: 面向大规模数据流的集成学习模型与方法研究
项目编号: No.71471022
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 管理科学
项目作者: 王昱
作者单位: 重庆大学
项目金额: 63万元
中文摘要: 近年来,随着网络化信息技术在各个行业的广泛应用,数据挖掘所面向的主要数据形态已由静态数据转变为具有海量性、动态性、概念漂移性等特性的大规模数据流,从而使得传统的数据挖掘技术很难有效地进行数据学习和知识发现。本项目针对网络环境下大规模数据流的特性,基于集成学习理论方法开展数据挖掘和知识发现研究,探讨大规模数据流中存在的特性对数据学习的效率和准确率的影响,并研究如何更加高效和准确地找出具有共性或规律性的信息和知识。项目的主要研究内容包括:(1)大规模数据流中的概念知识的形式化表示以及概念漂移检测;(2)具有增量性和可扩展性的集成学习模型及算法;(3)针对概念漂移的动态演化集成学习模型与算法;(4)基于集成学习的客户动态细分方法及软件系统原型。本项目从方法论的角度对面向大规模数据流的集成学习方法和技术进行研究,有助于提高企业和组织应用数据挖掘进行决策支持的水平,具有重要的理论意义和应用价值。
中文关键词: 大规模数据流;集成学习
英文摘要: Traditional data mining research and practice are focused on batch learning, in which the whole training data are available to the data mining algorithm that outputs a decision model after processing the data multiple times. In recent years, the developments of information and networks have dramatically changed the data collection and processing procedures. Data are generated and collected at high speed, meaning that the data are large-scale, dynamic, and often with concept drift. In these situations, data are modeled best not as static and persistent tables, but rather as transient data streams. Consequently, the traditional data mining techniques are inapplicable to effective and efficient knowledge discovery. In this research, we introduce the ensemble learning methodology to large-scale data streams mining. In ensemble learning, a pool of different base learners, instead of a single one, are constructed and combined to predict the class label of unknown instance. The main idea of ensemble learning is to take advantage of the base learners and avoid their weakness. For effective ensemble learning, it is required that base learners in the ensemble are accurate and diverse. Nevertheless, little attention has been paid to the diversity of the generated base learners in the existing research, which may lead to a degradation of overall learning performance. To overcome the above limitations, this research project investigates the impact of the characteristics of large-scale data streams on the efficiency and accuracy of data mining, and study how to construct ensemble learning models to find out interesting information and knowledge from the data streams with higher accuracy and efficiency. The four main aspects of this research are: (1) Concepts and knowledge representation and detection of concept drift in large-scale data streams. Fuzzy set theory is adopted to define the domain knowledge in the context where customers' interests and background are embedded. (2) Incremental and scalable ensemble learning models and algorithms. Sampling techniques, data reduction methods and instance-base learning are employed to construct the ensemble learning methods that are capable of incremental and scalable learning. (3) Dynamic and evolutionary ensemble learning models and algorithms for dealing with concept drift, which could present the evolutionary patterns of data characteristics, parameters, and optimal ensemble learning models. (4) Dynamic customers segmentation method and software prototype system based on ensemble learning. This research project studies the fundamental issues of data mining and knowledge discovery in dynamic and large-scale data streams, which is of critical importance to business intelligence and decision support nowadays. Therefore, it is meaningful and valuable both in theory and practice.
英文关键词: large-scale data streams;ensemble learning