项目名称: 面向文本信息安全的类别语义模型分类方法研究
项目编号: No.61202226
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 周晓飞
作者单位: 中国科学院信息工程研究所
项目金额: 22万元
中文摘要: 文本信息安全是互联网信息安全研究的重要问题,它的核心技术是文本分类技术。由于文本具有语义特性,使得文本信息安全亟需具有语义发现能力的高效文本分类方法。目前的文本分类研究对于语义特征的提取,仅实现了潜层语义空间对文档特征向量的降维作用,并没有充分的利用文档类别自身的语义特征;对相应分类算法来说,也没有有效利用类别语义信息。 面对文本信息安全对高性能文本分类方法的需求,本项目旨在研究兼顾类别语义和高效分类能力的分类方法。主要研究内容包括:1)针对类别样本有效的提取类别语义特征,研究基于显式和隐式特征的类别语义表达模型,避免语义表示的重计算;2)研究基于类别语义表示模型的分类理论和技术,设计兼顾类别语义和样本空间分布特点,并保持语义概率混合特性的分类器。项目的研究工作将为高效地分析文本信息深层安全性提供有效的理论、技术和方法,具有着重要的学术价值和科学意义。
中文关键词: 文本分类;潜在语义;分类器;特征提取;文本信息安全
英文摘要: Text information security is one of the most important problems in web information security field, and its crucial work is text document categorization problem. As a text document takes much semantic information, classification method for information security should have the capacity to discover the latent semantic under the document. Currently, the latent semantic models used in document categorization only realize the dimensional reduction for classifying, which could not capture class-semantic feature from each class, and corresponding classification processing in the semantic space also depends on the represented samples without directly utilizing class-semantic information. With the requirement of text information security research, the aim of this project is to research some text document classification methods, which can not only get class-semantic features but also obtain higher classification accuracy. The following researches would be studied in the project: (1) Research on capturing the class-semantic features from each class, and then construct the class-semantic representation models by the class-semantic features. There are two semantic representation models, apparent feature model and latent feature model in our project. Directly training classifiers on those representation models can avoid commo
英文关键词: Text classification;latent semantic;classifier;feature extraction;text information security