项目名称: 跨语言文本自动分类关键技术研究
项目编号: No.60803050
项目类型: 青年科学基金项目
立项/批准年度: 2009
项目学科: 金属学与金属工艺
项目作者: 代六玲
作者单位: 北京理工大学
项目金额: 19万元
中文摘要: 文本分类是文本挖掘的关键性和基础性问题之一。日益加快的全球一体化进程对跨语言的文本分类技术提出了迫切的需求。虽然目前研究者们已经进行了大量的文本分类相关研究工作,但是针对的跨语言文本分类问题的研究比较匮乏,限制了跨语言文本挖掘的发展和应用。本项目将针对多语言环境下跨语言文本分类的关键问题进行深入研究。具体研究内容包括:(1)基于特征概念的文本表示方法和特征概念的提取方法;(2)跨语言的文本相似度计算方法和类别判定方法;(3)中英跨语言分类测试语料集合的建立,实现原型系统,对算法进行评价和改进。通过本项目的研究,不仅能突破跨语言文本分类的难题,还可为跨语言的信息检索和文本挖掘提供有效的基础算法,使更大范围和更深层次的跨语言应用成为可能。
中文关键词: 文本分类;跨语言;文本挖掘;信息检索
英文摘要: Text categorization is key and fundamental issue of text mining. The rapid progress of globalization presents urgent demands of cross-lingual text categorization. Although researchers have performed lots of researches on text categorization, studies on cross-lingual text categorization are very absent. This situation limits the development and application of cross-lingual text mining. This project will thoroughly study the key problems of cross-lingual text categorization under multilingual settings. The main study topics include: (1) Concept based text representation and the abstraction of feature concepts. (2) Cross-lingual text similarity measurement and category determination. (3) Construction of cross-lingual corpus for categorization, the implementation of prototype system, the evaluation and improvement of algorithms. Through the researching of this project, we can not only overcome the difficult problem of cross-lingual text categorization, but also provide the fundamental algorithms for cross-lingual information retrieval and cross-lingual text mining. This will enable deeper and wider cross-lingual application.
英文关键词: Text categorization; cross-lingual; text mining; information retrieval