半监督进化文本聚类算法在动态多源文本分析上的研究

项目名称： 半监督进化文本聚类算法在动态多源文本分析上的研究

项目编号： No.61462011

项目类型： 地区科学基金项目

立项/批准年度： 2015

项目学科： 自动化技术、计算机技术

项目作者： 黄瑞章

作者单位： 贵州大学

项目金额： 42万元

中文摘要： 本课题通过建立一个新型的半监督进化文本聚类方法，利用Dirichlet过程（Dirichlet Process,DP）模型，结合主动学习方法，实现动态多源文本数据的自动聚类划分。运用新型的主动学习方法提炼监督信息，有效表达当前聚类结果、历史聚类结果、以及多源文本数据的特性，并转化为结构化监督数据指导半监督文本聚类。运用DP模型，结合监督信息，有效划分动态多源文本数据到任意多个聚类中。结合主动学习与半监督进化文本聚类，使主动学习与半监督进化文本聚类方法互相促进，并有效更新监督信息，逐步逼近理想的聚类划分。本课题是对进化文本聚类算法的突破研究，解决现行进化文本聚类算法的两个缺陷：(1) 倾向于划分数据到大规模聚类组中；（2）缺乏针对多源数据的聚类分析。在文本分析的应用领域，本课题的预期成果将为实际的互联网文本分析提供解决方法，为包含新闻和微博数据在内的动态互联网数据的分析提供探索性研究。

中文关键词： 文本挖掘；数据挖掘；文本聚类；半监督文本聚类；进化聚类

英文摘要： We aim to develop an innovative semi-supervised evolutionary document clustering approach to organize multiple correlated time-varying document collections. The semi-supervised evolutionary document clustering approach will be designed based on the dirichlet process (DP) model and will collaborate with an active learning model. We use the active learning model to collect informative supervised information which will be transformed into structured constraints to aid document clustering. The current document clustering partition, document clustering partitions for historical text data, and the multiple correlated document collection will be analyzed for generating supervised informtion. The semi-supervised document clustering approach, designed based on the DP model, will then be used to automatically organize multiple correlated time-varing document collections to arbitrary number of clusters. The active learning and the semi-supervised evolutionary document clusterng approach will collaborate and mutual promote in an iterative manner until a satisfied document clustering result is discovered. This project is extremely important for the research of the semi-supervised evolutionary document clustering problem. Two limitations of the current evolutionary document clustering problem will be handled: (1) the bias of assigning documents to relatively large document clusters for the DP approach; (2) the lack of the research on multiple correlated document collection. In particular, existing evolutionary document clustering approaches cannot due with multple correlated document collection and tends to group document points to relatively large clusters. From the application point of view, this project will provide a feasible solution for document analysis on real document articles collected from the Internet. We will develop a useful news and blog article analysis system to explore the application usage of the evoluationary document clustering.

英文关键词： text mining;data mining;document clustering;semi-supervised document clustering;evolutionary document clustering

成为VIP会员查看完整内容