项目名称: 基于图论模型的文本重叠聚类研究
项目编号: No.61202312
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 吴秦
作者单位: 江南大学
项目金额: 23万元
中文摘要: 针对现有文本聚类分析中"如何选择最佳的聚类类簇数和如何实现重叠聚类"这两个重要研究课题,探讨实现文本重叠聚类的图论模型和聚类方法。主要研究内容包括:(1)研究文本聚类的图论模型,将文本间的信息映射到图空间,把自下而上的层次聚类问题转化成子图逐层收缩问题;(2)研究加权图中子图的密度变化特点,通过选取合理的重叠子图实现文本聚类结果中"单文本多类属"这一重叠聚类目标;(3)研究图论中最大割的组合优化算法,将选择最佳类簇个数这一问题转化为寻找聚类层次图中最大割问题。在此研究结果上,结合申请者在文本特征建模方面的已有成果,将文本的结构信息映射为特征图以改良传统的文本特征信息,最终实现文本聚类类簇个数的自动确定和高效的重叠聚类。文本聚类广泛应用于信息索引、搜索引擎、文档主题识别等领域,是信息科学的一个重要研究问题。本课题的研究对文本信息技术的发展具有重要的学术和应用价值。
中文关键词: 重叠聚类;类簇;抽样;特征提取;图模型
英文摘要: "How to find the number of the clusters" and "how to model overlapping clustering" are two important research problems in document clustering. In order to solve these two problems, we propose a graph model and an overlapping clustering algorithm for documents categorization. Our research mainly focuses on: (1) Introducing a graph model for document clustering and mapping information between different documents into the graph model. Convert the hierarchical clustering of Documents into the contraction of subgraphs. (2) Selection of overlapping subgraphs in the graph model. Realize the overlapping clustering of documents by finding appropriate overlapping subgraphs in the graph model. (3) Optimization of the maximum cut problem. Using the max cut in the hierarchical clustering tree to get the best number of clusters. Based on our previous research results on graph model for text classification, the structual information of text document is mapped into a sinature graph. By applying the proposed clustering method to the signature graph, the number of clusters coud be automatically determined and good overlapping clustering results would be achieved. Document clustering has wide applications in information retrival, search engine, document topic identification. It is an important research field in information scien
英文关键词: overlapping clustering;cluster;sampling;feature extraction;graph model