面向大规模动态短文本的快速聚类及演化分析技术研究

项目名称： 面向大规模动态短文本的快速聚类及演化分析技术研究

项目编号： No.61300114

项目类型： 青年科学基金项目

立项/批准年度： 2014

项目学科： 自动化技术、计算机技术

项目作者： 刘铭

作者单位： 哈尔滨工业大学

项目金额： 23万元

中文摘要： 随着信息产业的飞速发展，以社会化网络为基础的虚拟交流平台逐渐成为用户参与网络讨论、获取信息的重要工具，而其中的海量动态短文本中蕴含了丰富的知识。因此，如何对这些海量的数据进行聚类分析，进而从这些数据中获取用户关注的信息、并掌握信息的演化过程逐渐成为研究的热点。然而由海量短文本数据引入的"高维向量稀疏"和"语义相似"问题，阻碍了传统的面向长文本的聚类分析技术在其上的应用，因此本项目拟通过分布式词聚类来降低特征空间的维度，拟通过迭代的相似度计算方法来获得短文本间的语义相似度。在此基础上，本项目拟借助实现面向大规模动态短文本的快速聚类来获取信息的演化过程，并依此反映用户的关注点在不同时间段内的整体变化趋势，进而以网格量化其变化幅度，以标签揭示其变化内容。

中文关键词： 短文本快速聚类；信息演化分析；语义相似度；动态聚类；

英文摘要： Along with the fast advance of IT industry, the virtual communication platform, which forms based on social network, has gradually become an important implement for users to join in network discussion and to acquire knowledge. The massive dynamic short-texts contained by it cover plenty of information. Thus, how to cluster those massive data, and furthermore to explore useful information concerned by users from those data and to comprehend information evolutional trend, has already become a hot research domain. Unfortunately, two issues of "high-dimension and vector sparsity" and "semantic similarity" aroused by large-scale short-texts, prevent conventional clustering techniques designed for long-texts from turning to short-texts. Therefore, this application applies distributional word clustering to reduce dimension of vector space and utilizes iteratively calculating process to obtain semantic similarity between short-texts. Based on them, this application proposes a fast and dynamic clustering algorithm for large-scale short-texts, which is applied to acquire information evolutional trend in order to reflect the transfer of user's attention through different time phases. Moreover, grid structure is applied to measure the magnitude of its alteration, and labels are extracted to show the change of its content.

英文关键词： short text clustering；data evolvement analysis；semantic similarity；dynamic clustering；

成为VIP会员查看完整内容