基于Web语料的维吾尔文重复模式识别算法及应用研究

项目名称： 基于Web语料的维吾尔文重复模式识别算法及应用研究

项目编号： No.61263044

项目类型： 地区科学基金项目

立项/批准年度： 2013

项目学科： 自动化技术、计算机技术

项目作者： 木妮娜·玉素甫

作者单位： 新疆师范大学

项目金额： 47万元

中文摘要： 序列中重复模式识别算法研究涉及到很多相关的计算机学科领域知识，并在数据挖掘、数据压缩、生物信息学、Web信息抽取等领域都有着重要的应用。本课题以维吾尔语Web语料库为研究对象，在前期预研的基础上，采用算法设计与原型系统检验相结合的方法，进行基于Web语料的维吾尔文重复模式识别算法及应用研究。为提高算法预处理阶段时间效率，设计新算法使之能同时计算后缀数组与最长公共前缀数组；研究与设计基于Web语料维文重复模式快速识别与统计算法；研究重复模式特征提取算法在维语Web文本聚类中的应用，进而设计一种基于重复模式的维语Web文本快速聚类算法。在此基础上，构造用于网络热点话题检测的多特征融合的特征词权重的计算方法，进行基于有意义串的网络热点话题发现方法研究并初步研究基于重复模式的Web信息语义表示方法。为维吾尔语Web文本挖掘、智能信息检索、网络舆情监测等领域研究提供理论基础及有力的技术支持。

中文关键词： 重复模式；维吾尔语Web文本；聚类分析；有意义串；需求分析与建模

英文摘要： Algorithms for sequence repeat pattern recognition are related to many computer science domain knowledge, and have a wide range of applications in the fields of data mining, data compression, bioinformatics, and Web information extraction. This project takes the Uyghur Web corpus as the research object, and is based on the pre-research, to design the algorithms for Uyghur repeat pattern recognition and its applications,by using the combination methods of algorithm design and prototype system testing. To improve the algorithm pre-processing time efficiency, we will design a new algorithm so that it can calculate the suffix array and longest common prefix array at the same time. We analyze the repeat pattern extraction algorithm in the applications of Web text clustering, and design a repeat pattern based Uyghur Web text clustering algorithm. On this basis, we will construct a method to calculate the key term weight with multi-features, and study the meaningful string-based network hot topic extraction method, thereby analyze the Uyghur Web information semantic representation. We hope to achieve the goal of providing a theoretical basis and strong technical support for the fields of Uyghur network text mining, intelligent information retrieval, and network monitoring public opinion.

英文关键词： repeat；Uyghur Web text；clustering analysis；meaningful string；requirements analysis and modeling

成为VIP会员查看完整内容