基于高维大规模数据的集成建模方法的研究

项目名称： 基于高维大规模数据的集成建模方法的研究

项目编号： No.61473072

项目类型： 面上项目

立项/批准年度： 2015

项目学科： 自动化技术、计算机技术

项目作者： 毛志忠

作者单位： 东北大学

项目金额： 80万元

中文摘要： 数据建模是复杂工业系统最有效的建模方法之一，而复杂工业系统中的数据往往具有维数高规模大的特点。利用这类数据集进行建模，除了要求模型需要满足一定的精度外，还希望尽可能地减小模型的计算量，给建模方法的研究带来很大难度。针对高维数大规模数据，本项目拟根据多元模糊泰勒定理提出一种新颖的基于特征子集（Feature Sub-Sets-FSS）的集成建模方法。该方法从1维到设定的最大维数，直接、独立、全面地划分特征空间以建立FSS；通过选择适当的基学习算法，在所有或部分FSS上建立子模型；最终通过子模型的融合得到所需的模型。特征空间划分的直接性可促使FSS简单快速地获得；全面性保证原始数据集信息的完整性，确保集成模型的准确度；独立性使得FSS互不依赖，实现并行的集成结构。FSS的样本个数与原始训练集相同，而输入维数明显减低，在这些低维数的FSS上建立子模型大大降低模型的计算量。

中文关键词： 高维大规模数据集；特征子集；集成模型

英文摘要： Building and estimating models based on data sets is one of the most effective techniques in complex nonlinear systems. While date sampled form most of complex nonlinear systems are always with the properties of large scale and high dimension. Modeling on this kind of data sets, it is difficult to build nonlinear models which are required to achieve a reasonable computational cost with improved prediction accuracy and generalization. A fundamentally different ensemble approach that is based on multivariate fuzzy Taylor theorem, called Feature Subsets Ensemble (FSE), will be proposed in this item to deal with the large-scale and high-dimensional problem. FSE constructs all FSSs of the original feature space that are defined over from one to a fixed number of input features. The set of direct, exhaustive and independent feature-space subdivisions forms the basis to develop FSE. Then selects suitable base learner to build sub-models on all or a random subset of these FSSs, and combines all sub-models to obtain the needed ensemble model. The direct division of the original feature-spaces makes the generating process of FSSs simple and speed; the exhaustive division keeps the integrality of the information of original training sets and further insures the predictive accuracy of FSE; the independency division separates form each other of FSSs and induces the parallel structure of FSE. In each FSS the number of samples is same as the original training set, but the dimension of input features is substantially reduced. Learning on substantially lower dimension data sets, the FSE is characterized by low complexity.

英文关键词： High-dimensional and Large-scale Data Sets;Feature Subsets;Ensemble Model

成为VIP会员查看完整内容