项目名称: 实时流数据变系数多分类模型研究
项目编号: No.61472475
项目类型: 面上项目
立项/批准年度: 2015
项目学科: 其他
项目作者: 吕晓玲
作者单位: 中国人民大学
项目金额: 58万元
中文摘要: 在当前的大数据时代背景下,很多领域产生了一种新的数据形态:实时流数据。它是一个海量、高维、稀疏、实时、无限、连续、有序的数据序列。针对实时流数据的研究成为近年来机器学习与知识发现领域的热点问题。数理统计学中的变系数模型是研究该数据类型的一个很好的工具,能够详细刻画协变量与时间变量的相互关系,解释事物随时间演化的规律。但现有的研究成果还基本上局限在小数据/小世界的理论和算法,不能真正满足大数据分析的需要。本项目将在梳理传统变系数模型研究成果的基础上,主要针对实时流数据的多分类问题,从基础模型的构建、模型的稳定性、模型的在线学习与分布式实时算法开发等角度研究大数据时代背景下的数理统计学与信息科学相融合的分析实时流数据的变系数多分类模型的全新理论。并结合两个大数据应用实例提出高效、具有广泛社会应用前景与价值的操作流程。
中文关键词: 多分类模型;稀疏模型;高维数据;特征选择;正则化
英文摘要: In the age of big data, a new type of data is emerging: realtime streaming data. It's a big, high dimensional, sparse, realtime, unlimited, continuous, ordered data sequence. Study on such kind of data has become a hot topic in the area of data mining and machine learning. Varying coefficient model in mathematical statistics is a very good tool to study such kind of data. It can describe the relationship between covariates and time, expplain things' evolution patterns. Current research results are limited to theories and algorithms in small data/small world. It can't meet the need of big data. Dased on the traditional study on the varying coefficient model,this project will study on the multicategory classification methods for realtime streaming data. The new theoretical framework is a combination of mathematical statistics and information science. It include the fundamental model, stability analysis, online learning and distributed realtime computing. Finally, we will give the empirical data analyses for two real world big data.
英文关键词: Multicategy Classification;Sparse Model;High Dimensional Data;Feature Selection;Regularization