项目名称: 基于藏文网络动态流通语料的语义文本分类技术研究
项目编号: No.61309012
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 胥桂仙
作者单位: 中央民族大学
项目金额: 22万元
中文摘要: 语义本体是共享概念模型明确的形式化规范说明。英汉语义知识库资源丰富,应用广阔。藏文语义资源短缺,加快建设步伐势在必行。藏文网络资源增长迅速,基于藏文网络动态流通语料的语义文本分类技术可以实时采集网络数据,并进行实时数据分析和处理,提供精准的分类结果;同时可以使相关部门快速地掌握网页动态,并做出正确的舆论引导。本项目对藏语分类本体创建技术开展研究,首先采用信息论方法对藏语分类语料进行类别主题词抽取,基于主题词、Hownet语义知识结构、藏汉电子辞典释义创建分类本体的概念层次,准确描述概念间的关系;对藏文网络流通语料的实时预处理技术进行研究,自动地抽取重要信息;对基于本体的语义空间映射、概念相似度及加权语义网文本相似度计算、语义分类算法进行研究,提高文本分类精度。本课题有助于解决藏语本体分类体系创建、Web语义文本分类等关键技术问题,同时对开展藏语信息检索、机器翻译等语义层面研究提供有效支持。
中文关键词: 语义文本分类;藏文信息处理;数据挖掘;藏文网页分类;本体
英文摘要: Ontology is an explicit formal specification of a shared conceptual model. English and Chinese semantic knowledge resources are rich.The applications are broad. Tibetan semantic resource is rare. It is imperative to speed up the pace of Tibetan semantic construction. The growth of Tibetan web pages is rapid. The technology of the semantic text categorization based on dynamic Tibetan network corpus can collect web data in real time, analysis and process web pages, provide the accurate classification result. It can make the relevant departments to master the status of dynamic web page quickly and make the correct guidance of public opinion. This project researches on the constructing technology of Tibetan classification ontology. Firstly, the key words of the classes are extracted from Tibetan classification corpus with information theory. Then based on the key words, semantic knowledge structure of Hownet, Tibetan and Chinese electronic dictionary, we construct the semantic conceptual hierarchy of class ontology and describe the relationships between concepts accurately. We study on the preprocessing technique of the dynamic Tibetan web corpus so that the important information will be extracted. We focus on the researches of semantic space mapping based on the ontology, the concept similarity computation and the
英文关键词: Semantic text classification;Tibetan information technology;Data mining;Tibetan web pages classification;Ontology