项目名称: 半监督文本聚类算法在个性化文本分析上的研究
项目编号: No.61202089
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 黄瑞章
作者单位: 贵州大学
项目金额: 25万元
中文摘要: 本课题通过建立一个新型的半监督文本聚类方法,结合主动学习方法,并利用狄利克莱过程混合模型,实现文本数据的个性化聚类划分。运用新型的主动学习方法提炼用户聚类需求,并转化为结构化监督数据指导半监督文本聚类。运用狄利科莱过程混合模型,根据用户的聚类需求,个性化的划分文本数据到任意多个聚类中。结合主动学习与半监督文本聚类,使主动学习与半监督文本聚类方法互相促进,逐步逼近用户聚类理想方案。本课题是对半监督文本聚类算法的突破研究,解决现行半监督文本聚类算法的两个难点问题:(1) 忽略了用户的个体意愿,无法个性化的整理分析文本数据;(2)聚类数目被假设为已知参数,由用户在运行聚类算法之前提供。在文本分析的应用领域,本课题的预期成果将为个性化文本分析提供解决方法,并将为个性化新闻数据分析的实际应用提供探索性研究。
中文关键词: 文本挖掘;数据挖掘;文本聚类;半监督文本聚类;
英文摘要: We aim to develop an innovative semi-supervised document clustering approach to organize document collection based on user's individual grouping preference. The semi-supervised document clustering approach will be designed based on the dirichlet process mixture model and will collaborate with an active learning model. We use the active learning model to collect user's individual grouping preference which will be transformed into structured constraints to aid document clustering. The semi-supervised document clustering approach, designed based on the dirichlet process mixture model, will then be used to automatically organize document collection based on user's grouping preference without the necessity of having the number of clusters in advance. The active learning and the semi-supervised document clusterng approach will collaborate and mutual promote in an iterative manner until a satisfied document clustering result is discovered. This project is extremely important for the research of the semi-supervised document clustering problem. Two limitations of the semi-supervised document clustering problem will be handled: (1) the neglection of users' grouping preferences; (2) the assumption of knowing the number of clusters. In particular, existing semi-supervised document clustering approaches cannot discover user'
英文关键词: text mining;data mining;document clustering;semi-supervised document clustering;