语料标注标准的自动迁移研究

项目名称： 语料标注标准的自动迁移研究

项目编号： No.61202216

项目类型： 青年科学基金项目

立项/批准年度： 2013

项目学科： 计算机科学学科

项目作者： 姜文斌

作者单位： 中国科学院计算技术研究所

项目金额： 23万元

中文摘要： 人工标注语料库是自然语言处理统计建模的主要知识源。语料库的构建通常需要语言学工作者付出大量的劳动，昂贵且耗时。然而对许多语言来说，却存在着严重的资源浪费现象，即同一自然语言处理任务存在着多个不同标注标准的人工语料库。因此，提出一种自动化的融合或转换算法，既能将不同标注标准的语料库知识融合起来，又能将语料库从一种标注标准转为另一种标准，从理论和实践角度都具有重要的意义。该问题可形式化为标注标准迁移问题，本提案为标注标准迁移提出一种高效且通用的迁移策略，用于将不同标注标准的知识融合起来（标准融合）或将一种标注标准的知识转换为另一种标准（标准转换）。我们设计出判别式的统计模型，以自动地学习不同标注标准之间的融合和转换规律。该工作既可以整合不同语料库以搭建更高精度的自然语言处理分析器，又能够为语言分析和语料库构建提供统计层面的启示，最终有助于推动整个统计自然语言处理的发展，更好地为社会服务。

中文关键词： 标注标准；迁移学习；标注语料库；；

英文摘要： Human-annotated corpora are main source of knowledge for statistic NLP modeling. The building of a corpus usually needs lots of linguistic workers, therefore, is an expensive task. For many languages, however, there often exist multiple corpora for the same task with vastly different and incompatible annotation philosophies, which is a great waste of human efforts. It seems valuable theoretically and practically to develop automatic adaptation or transformation strategies, to integrate knowledge in corpora with different annotation standards or to transform a corpus from one annotation standard to another. These kinds of problems can be formalized as automatic adaptation of annotation standards. In this proposal we propose automatic, effective and universal statistical strategies to adapt different annotation standards, in order to integrate knowledge in corpora with different annotation standards (denoted as annotation integration), or transform a corpus from one annotation standard to another (denoted as annotation transformation). We design discriminative models to automatically learn the statistical regularity for transforming the annotation standard of the source corpus into the standard of the target corpus. This work can finally build more powerful NLP tools with resources under different annotation stan

英文关键词： annotation guideline；transfer learning；annotated corpus；；

成为VIP会员查看完整内容