项目名称: 文本自动分类中样本重要性模型及应用研究
项目编号: No.61272212
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 王明文
作者单位: 江西师范大学
项目金额: 70万元
中文摘要: 文本自动分类在有效分析和利用因特网数据方面有着重要作用,但这些数据的海量性和高维性是自动分类面临的主要难题。一种直接有效的解决途径是在保证学习算法分类性能的前提下,通过样本集约简或维数约简降低计算复杂性,并提高分类器的泛化能力。现有样本选择方法多基于统计抽样技术,需独立同分布假设;Boosting和最大间隔方法虽隐含样本选择思想,但依赖于具体的分类算法。本项目受认知科学中的样例理论启发,不对训练样本的分布做任何统计假设,从样本角度出发,根据样本对分类的贡献程度,提出样本重要性原理;拟应用随机过程和高维数据统计分析理论,给出训练集中类边界样本的自动判别方法,建立不依赖于具体分类器的样本重要性模型,研究样本重要性计算算法,并给出理论证明;结合已有分类算法,研究融合样本权重的分类算法;构建样本重要性与特征重要性的对偶关系,研究相应的特征选择和样本选择的新方法,为文本分类及一般分类问题提供新的思路
中文关键词: 文本自动分类;样本重要性;特征选择;类边界;对偶关系
英文摘要: Text automated categorization is important to analyze and organize the Internet data effectively. The main challenges of automated categorization are massive scale and high dimensionality of the data. A direct and effective approach is to reduce computing complexity using the sample reduction or dimensionality reduction, which can improve the classifier's generalization ability and without loss of classification performance. The most of sample selection methods are based on statistical sampling theory, in which the samples should obey independent identical distribution(iid). Boosting and large margin approaches imply the thought of sample selection, but they depend on the specific algorithms. Inspired by the theory of worked example in cognitive science,this project proposes sample importance principle. The sample importance is measured by the contribution of samples to classification without any statistical assumption . In order to derive sample importance model that is not depend on sepecific classifiers, we will provide the approaches of automatically identifying class boundaries in the training data set by using random process and high-dimensional data analysis theory to design the algorithms of computing sample importance and to give mathematical proof. For example, we can exploit a random walks algorithm
英文关键词: Text automated categoriaztion;Sample importance;Feature selection;Boundary;Dual relationship