项目名称: 多标记文本数据流分类方法研究
项目编号: No.61503112
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 其他
项目作者: 李培培
作者单位: 合肥工业大学
项目金额: 22万元
中文摘要: 现实世界中的数据流尤其是文本数据流(例如:微博博文数据、网络购物评论数据等)数据含多个标记的现象,使得面向单标记数据流处理的分类算法难以直接使用。而数据流的海量、快速、多变等特点,又使得传统的多标记数据分类方法面临挑战。因此,本课题拟开展在线多标记文本数据流分类方法研究,重点研究基于实体语义上下文特征表示等策略的在线多标记数据分类方法。在文本数据流实体识别与语义上下文特征表示研究的基础上,开展标记间依赖关系、特征与标记映射关系的形式化表示方法与在线特征选择方法研究,进而开展在线多标记文本数据流分类模型的构建、更新与评估等方面的研究;再次,研究多标记数据流环境下的数据分布变化的检测与预测方法,探索特征与标记的映射关系变化和噪音对数据分布变化影响的定性与定量关系,构建相应的度量标准与评价体系。基于上述研究,以微博博文分类为例,设计并实现面向Web服务应用领域的多标记数据流分类的原型系统。
中文关键词: 多标记;数据流;分类;数据分布变化
英文摘要: Most existing works on data stream classification are only suitable for single-label data streams. It is a challenge to apply them in the real-world data streams specially the text data streams (including the Weibo articles and the online shopping reviews) where instances have multi-labels. Meanwhile, when performing classification on data streams, traditional techniques for multi-label data classification have a relatively poor efficiency in both time and space due to the characteristics of data streams. Therefore, in our proposal, we focus on the study of online learning methods for multi-label Web data streams, especially on the study of online methods based on the feature representation of the semantic contexts of terms etc. More specifically, we first design new techniques of term recognition and feature representation of the semantic contexts in multi-label text data streams, and then we study the label dependence and matching functions between features and labels, and the online feature selection-based formalization methods. Correspondingly, we design the effectively and efficiently online multi-label data stream classification models and the corresponding evaluation measures. Secondly, we focus on the detection and prediction of the data distribution changing using the above multi-label data stream learning models. Meanwhile, we analyze the qualitative and quantitative relationship between the data distribution changing and the changing of matching functions between features and labels and noisy data, and then propose the corresponding evaluation measures. Lastly, we apply our methods into labeling the contents of Weibo articles and design a prototype classification system for multi-label data streams.
英文关键词: Multi-label;Data Stream;Classification;Concept Changing