项目名称: 汉语词义标注语料库的自动构建及一致性检验技术研究
项目编号: No.60873013
项目类型: 面上项目
立项/批准年度: 2009
项目学科: 轻工业、手工业
项目作者: 张仰森
作者单位: 北京信息科技大学
项目金额: 32万元
中文摘要: 本项目研究大规模词义标注语料库的自动构建方法以及标注语料库的一致性检验技术。在语言知识资源的可计算性分析及加工转换处理、词义消歧知识的提取与模式化方法、汉语词义消歧与标注的语言模型与算法、词义标注语料库的一致性检验技术等方面进行了深入研究。创新的成果有:(1)提出了利用语料库和《知网》相结合构建语义搭配隐性特征知识库的方法;(2)提出了动态自适应加权的多分类器融合词义消歧模型;(3)提出了基于隐最大熵原理的词义消歧方法;(4)提出了基于标准模式库的词义标注语料库一致性检验方法,并在语句相似度计算方面有所创新;(5)设计了一个用于词义消歧模型研究的词义消歧实验框架系统;(6)建立了约1200万字的词义标注语料库,实现了一个词义标注一致性检验原型系统。基于词义消歧研究和大规模语料库的成果,开展了文本语义级查错、Web文本分类、web突发事件新闻信息挖掘与推荐方法、领域问答系统等方面的研究工作。发表研究论文42篇,其中国外期刊2篇,国内期刊20篇,国际会议18篇,国内会议2篇,软件著作权5项。本项目所提出的词义消歧模型与方法对于汉语语料库的构建具有重要的理论意义和实用价值。
中文关键词: 词义消歧模型;多知识源;标注语料库;一致性检验
英文摘要: In this project, we have researched the method for establishing the large-scale word sense tagged corpus automatically and the technology for consistency test of tagged corpus. We have studied in-depth in the computability analysis of the linguistry resources and its processing, the method of word sense disambiguation knowledge extracting and modeling, the language model and algorithm of Chinese word sense disambiguating and tagging and the technology of consistency test for tagged corpus. The achievements of innovation as follows: (1) Put forward the method for constructing semantic collocation latent feature knowledge base by using the combination of Corpus and HowNet;(2) Put forward word sense disambiguation model based on ensemble classifier with dynamic weight adaptation;(3) Put forward word sense disambiguating method based on the principle of latent maximum entropy;(4) Put forward the method of word sense tagged consistency test based on the standard mode libraries;(5) Designed experimental framework system for word sense disambiguation model research;(6) Constructed a word sense tagged corpus about 12 million words, and implemented a prototype system for word sense tagged consistency test. Based on the research achievements of word sense disambiguation and the large-scale corpus, the following researches were carried out: the text semantic error detection, Web text classification, web emergency news information mining and recommend method, domain question answering system. Published 42 research papers, which have 2 in foreign journal, 20 in domestic journal, 18 in international conference, 2 in domestic conference, 5 software copyrights. The model and method of word sense disambiguation were proposed in this project has important theoretical significance and practical value for the construction of the Chinese corpus.
英文关键词: Word sense disambiguation model; multiple knowledge source;tagged corpus;consistency test