项目名称: 无监督分词及词性归纳联合方法研究
项目编号: No.61303105
项目类型: 青年科学基金项目
立项/批准年度: 2014
项目学科: 自动化技术、计算机技术
项目作者: 王函石
作者单位: 首都师范大学
项目金额: 25万元
中文摘要: 无监督分词和词性归纳作为相继任务,是计算语言学中重要的研究课题,具有较高的理论研究价值和广阔的应用前景。本研究拟提出无监督分词及词性归纳相结合的联合方法,使分词和词性归纳两个不同层次的统计信息相互补充,以期同时提高两种不同处理在自然语言理解中的性能。本联合方法基于申请者先前提出的无监督分词方法和环境内聚思想,一方面通过获得基于语素及其类别的不针对特定语言的形态信息,以进一步提高处理精度,另一方面通过获得一词多类的归纳结果,以及利用全局统计特征分辨封闭词类与开放词类,以产生接近人工标准和便于人类理解的处理结果,通过提高评估成绩达到提升性能的目的。本研究成果将为构建包含语法归纳在内的更大规模的无监督联合方法奠定基础。
中文关键词: 自然语言理解;;;;
英文摘要: The unsupervised word segmentation and part-of-speech induction are two important tasks in computational linstuistics. In the project, we propose an unsupervised joint approach to word segmentation and part-of-speech induction. In the approach, the segmentation method of morphemes and words is based on the unsupervised approach to word segmentation early proposed by us, and the induction method of morpheme classes and word classes is based on the context cohesion mechanism early proposed by us. As an unsupervised approach, it can process data without any man-made lexicons, manually annotated corpora and language-specific prior knowledge. As a joint approach, it can utilize the structural information of word class sequences to enhance the quality of the unsupervised word segmentation, and then improve the performance of the word class induction by the enhancement. Besides the advantages mentioned above, the morphologic information derived from morpheme classes and word structures can further improve the performance of both the unsupervised word segmentation and the word class induction. In addition, the approach can produce the induction results that can be understood by humans to some extent by using the different statistical features between open-class words and closed-class words. In future, the approach will
英文关键词: natural language understanding;;;;