基于海量语料自然标注信息的汉语自然语块分析

项目名称： 基于海量语料自然标注信息的汉语自然语块分析

项目编号： No.61300081

项目类型： 青年科学基金项目

立项/批准年度： 2014

项目学科： 自动化技术、计算机技术

项目作者： 于东

作者单位： 北京语言大学

项目金额： 23万元

中文摘要： 语块分析将复杂语句划分为较细粒度的片段，可以有效降低信息处理复杂度。本课题以海量汉语语料中标点符号、边界标记、功能词等自然标注信息作为语块划分知识源，提出自然语块的概念，特指在海量语料中稳定、频繁出现，具有明显边界特性的语言片段。自然语块不受语法规则约束，在处理汉语边界划分问题上具有其优势。本课题研究无监督的汉语自然语块识别方法。利用海量语料中自然标注信息挖掘语言边界知识；研究基于统计的语块边界特征提取方法，对语言边界知识建模；将自然语块分析转化为状态空间搜索问题，研究搜索空间裁剪和快速解码算法。自然语块边界划分具有柔性，针对不同应用，对应的合理划分也不同。课题研究从不同侧面评估自然语块合理性的方法。研究语块粒度控制和参数训练方法。分析自然语块对海量词典的覆盖度，考察其对汉语词汇知识的描述能力；从同构性角度分析自然语块与中文分词、汉语韵律短语的一致性，对自然语块分析性能作出评价。

中文关键词： 自然标注信息；自然语块分析；语言边界；无监督分词；词嵌入

英文摘要： Chunking is a method segmenting complex sentences into segments at fine granularity, in order to reduce complexity of language processing. In this subject, we proposed Natural Boundary Chunking, specifying stable and frequent appearing segments with distinctive boundary features in massive scale corpus. Natural Boundary Chunking is unbound with syntactical rules, therefore it has advantages in Chinese demarcation. Unsupervised Natural Boundary Chunk recognition methods will be researched. In massive scale corpus, we will research boundary information mining with Natural Annotations, statistical boundary extraction and boundary information modeling in this subject. Chunking problem is transferred to state space search problem. Its pruning and fast decoding algorithm are also included in our research. Natural Boundary Chunking is flexible. To various application requirements, different reasonable chunkings for one sentence will be presented. We will research rationality evaluation methods of Natural Boundary Chunking. Ability of lexicon description of Natural Boundary Chunking will be evaluated by the recall of massive scale dictionary. Isomorphism with Chinese Prosodic Phrase and word segmentation will be researched to evaluate analytical ability of Natural Boundary Chunking

英文关键词： Natural Annotation；Natural Boundary Chunks；Language Boundary；Unsuperviesd Chinese Word Segmentation；Word Embeddings

成为VIP会员查看完整内容