项目名称: 藏语命名实体识别关键技术研究
项目编号: No.61303165
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 诺明花
作者单位: 内蒙古大学
项目金额: 22万元
中文摘要: 命名实体识别技术是信息抽取、句法分析、跨语言检索等自然语言处理领域研究的前导技术和难题。藏文在自然语言处理方面研究起步比较晚,基础研究薄弱,当前尚未完全解决藏语命名实体高精度自动识别问题。本项目以藏语命名实体为研究对象,通过分析藏语人名、地名、机构名的内部和外部特征,充分结合规则和统计方法的优点,提出一种适合藏语自身的快速、高效、精准的藏语命名实体识别框架。首先,基于机器学习算法分别从大规模藏语语料和汉藏对齐语料中构建机构名识别知识库和汉藏对应的音译对照统计库,改进藏语命名实体识别的精度;其次,研究基于层次式机器学习模型的藏语命名实体识别方法,将简单和复杂命名实体集中在统一识别框架下,研究多个子模型的参数学习方法;本项目将建立藏语机构名识别知识库、汉藏对应的音译对照统计库、藏语命名实体标注语料,为藏语自然语言处理的研究提供基础。
中文关键词: 藏文;藏语命名实体识别;藏语未登录词识别;藏文基本名词短语识别;条件随机场模型
英文摘要: Named Entity (NE) recognition plays an important role on natural language processing such as information extraction, syntactic analysis and cross-language retrieval. However, Tibetan NE recognition with higher precision is still an unresolved problem because of inadequate data resources and the limitation of existing recognition algorithms. We propose a fast and efficient Tibetan NE recognition framework with higher precision by analyzing the internal and external features of the Tibetan person name, location name, and organization names. The recognition framework combines the advantages of rule-based with statistical-based recognition methods. Firstly, we build organization name knowledge base from large-scale Tibetan corpus and Chinese-Tibetan transliteration correspondence knowledge base from Chinese-Tibetan aligned corpus using machine learning algorithms. These two knowledge bases are helpful to improve the accuracy of Tibetan NE recognition. Secondly, we adopt a hierarchical Tibetan named entity recognition method and integrate simple and complex named entity into a unified framework. We also study parameter learning methods for multiple models within our recognition framework. Eventually, Tibetan organization name knowledge base, Chinese-Tibetan transliteration correspondence knowledge base and Tibetan na
英文关键词: Tibetan;Tibetan named entity recognition;identification of Tibetan out-of-vocabulary words;identification of Tibetan BaseNP;CRF