项目名称: 共现潜在语义向量空间模型及其语义核的构建与应用研究
项目编号: No.71503151
项目类型: 青年科学基金项目
立项/批准年度: 2016
项目学科: 管理科学
项目作者: 牛奉高
作者单位: 山西大学
项目金额: 17万元
中文摘要: 文本数据是当前大数据的时代的主要形式,对文本数据的挖掘成为信息获取和知识发现的重要途径。向量空间模型(VSM)为信息检索提供了非常好的解决方法,随着研究的深入,又出现了语义向量空间模型(SVSM)及类似模型,使检索效果和文本挖掘的效果更好。但依然存在不足:或者是向量表示中语义表现不够,或者是语义提取成本过高,或者是计算复杂度高。鉴于此,本人初步提出了共现潜在语义向量空间模型(CLSVSM),在文献聚类应用中,不仅降低了语义提取成本,还得到了较好的效果。但计算复杂度还是很高,而且不利于推广。语义核方法可以规范计算过程,降低复杂度,并可以推广应用,比如文本信息检索、分类、文献聚合、机器学习等领域。本项目拟在优化CLAVSM的基础上,采用语义核的思想,构建CLSVSM的语义核并应用于文献主题聚类中以检验其效果。
中文关键词: 文本挖掘;语义关联;知识发现;信息检索;文献聚合
英文摘要: The text data is currently the main form of the era of big data, and text data mining has become an important way of information access and knowledge discovery. Vector Space Model (VSM) provides a very good solution for information retrieval. with further research, there was a semantic vector space model (SVSM) and similar models, making retrieval and text mining results better. But still not enough: either a vector representation of semantic performance is not enough, or too costly to extract semantic, or high computational complexity. In view of this, I initially proposed co-occurrence latent semantic vector space model (CLSVSM). In the literature clustering, the model not only reduces the cost of semantic extraction, also got good results. However, the computational complexity is very high, and the model is not conductive to be used widly. Semantic kernel method can standardize the calculation process, reduce complexity, and can be extended applications, such as text information retrieval, classification, aggregation literature, machine learning and other fields. The project is planned on the basis of optimized CLAVSM, adopting semantic core idea, to build CLSVSM semantics kernal and applied to literature topic clustering to test its effectiveness.
英文关键词: Text mining;semantic association;knowledge discovery;Information retrieval;Literature aggregation