项目名称: 基于增量式学习的可扩展偏最小二乘模型的研究
项目编号: No.61463033
项目类型: 地区科学基金项目
立项/批准年度: 2015
项目学科: 其他
项目作者: 曾雪强
作者单位: 南昌大学
项目金额: 44万元
中文摘要: 在大数据时代,数据量的急剧增长给传统机器学习技术带来了严峻挑战;如何让传统的机器学习模型能够适应并处理海量的数据是大数据时代机器学习研究的焦点之一。偏最小二乘(Partial Least Square, PLS)作为一种多元数据分析的有效技术,在WEB文本分类和生物医学数据分析等多个科学技术领域显示出别具特色的性能,值得深入研究;但传统PLS算法的可扩展性较差,不能适应大数据的要求。本项目主要从大数据同时具有高维特征和海量样本的特点入手,研究1)可增量式学习的PLS算法,解决海量样本的建模效率问题;2)利用特征选择技术提升增量式PLS算法在高维特征情况下的性能;3)针对多因变量(多标记)和非线性等复杂数据的特点,设计多因变量非线性的可扩展增量式PLS算法。新算法预期将明显提升PLS在大数据上的可扩展性能,提供WEB文本和生物医学等大规模数据挖掘的新方法。
中文关键词: 增量式学习;偏最小二乘;大数据;数据降维;特征抽取
英文摘要: In the age of big data, the rapid growth of data has posed a serious challenge to traditional machine learning technologies. How to make traditional machine learning models to adapt and handle vast amounts of data is one of the key problems of the big data analysis. As an effective multivariate data analysis technique, Partial least squares (PLS) model has a very important value in many application areas, especially in the fields of WEB text classification and biomedical data analysis. But traditional PLS model cannot meet the requirements of big data, since its scalability is obstructed by the high dimensional feature space and mass data samples. In order to solve this problem, the project mainly studies 1) the efficiency scalable PLS model on massive data based on the incremental learning technology, 2) boosting the performance of incremental PLS model on very high feature dimensional data by introducing feature selection into feature extraction model, and 3) improved incremental PLS models for some complex data, such as multiple dependent variables (multi-label) and nonlinear problems. The proposed models will significantly enhance the PLS algorithm scalable performance on big data, especially for the data mining problem on large-scale WEB text and biomedical data.
英文关键词: Incremental Learning;Partial Least Squares;Big Data;Dimension Reduction;Feature Extraction