项目名称: 基于协同训练策略的不完全标记数据流分类问题研究
项目编号: No.61273292
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 胡学钢
作者单位: 合肥工业大学
项目金额: 80万元
中文摘要: 现实世界数据流中类标签大量缺失的现象,使得基于示例是有标签假设的数据流分类算法难以适用。而数据流的海量、快速等特点,又使得传统的不完全标记数据处理方法面临挑战。因此,研究数据流中不完全标记数据处理的有效算法成为关键任务。本课题拟开展不完全标记数据流在线半监督学习方法研究,重点研究基于协同训练策略的在线半监督学习方法。首先在设计大纲数据提取数据流机制的基础上,研究协同训练策略的适应性理论与基于协同训练策略的类传播机制,构建鲁棒性在线半监督学习模型,并设计模型的泛化能力等评估标准。其次,为使所建的模型适应数据分布特征不断变化的特点,研究不完全标记数据流环境下的数据分布变化的检测与预测方法,探索无标签示例和噪音对数据分布变化影响的定性与定量关系,构建相应的度量标准与评价体系。基于上述研究,以网络产品评价内容分类为例,设计并实现面向Web服务应用领域的不完全标记数据流分类的原型系统。
中文关键词: 无标签数据;分类;概念漂移;数据流;
英文摘要: Most existing work on classification of data streams assumes that all arrived streaming data are labeled and the class labels are immediately available. However, in real-world applications, this assumption seems invalid. Thus, it is a challenge to learn from concept drifting data streams with unlabeled data. Meanwhile, when performing classification on data streams, traditional techniques for unlabeled data and labeled data have a relatively poor efficiency in both time and space due to the characteristics of data streams. Thus, it is significant to develop more efficient algorithms in the handling of data streams with unlabeled data. In our proposal, we focus on the study of online semi-supervised learning methods for data streams with unlabeled data, especially on study of online methods with co-training. More specifically, we first design new summarization techniques for data streams with unlabeled data, and then analyze the adaptation of co-training technique in data streams. Correspondingly, we focus on the research of labeling propagation methods in co-training, and aim to design the effectively and efficiently online semi-supervised learning methods and the corresponding evaluation measures. Secondly, we focus on the detection and prediction of the changing of data distributions using the above semi-super
英文关键词: Unlabeled data;Classification;Concept drifting;Data stream;