项目名称: 篇章结构分析及基于双语投射的篇章标注方法研究
项目编号: No.61202244
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 鉴萍
作者单位: 北京理工大学
项目金额: 23万元
中文摘要: 自然语言处理经历了几十年的发展,分析的对象从字、词、短语到句子,自然而且必然地进入了篇章这一层面。在统计自然语言处理思想和语料库语言学盛行的今天,随着宾州篇章树库的发布,学者们开始尝试借助各种机器学习方法,通过对篇章关系的标注来解释篇章结构,引发了篇章结构分析的热潮。但是,由于篇章问题的复杂性,篇章关系分析的核心部分- - 隐式关系的判别,其准确率没有超过50%。这也是篇章分析处于起步阶段的最好证明。本项目首先将矛头指向这一难题。汉语方面,目前最大的问题是没有大规模的篇章语料库, 严重制约了汉语篇章的研究和应用。而篇章语料库的标注又无疑是一项难度大、费时费力的工程。在本项目中,我们希望借助汉英双语平行树库这一资源,通过对英语端的篇章分析,来得到汉语的篇章关系标记。无论将获得的汉语篇章语料作为种子语料,还是视其为一种篇章标注的框架,都将是未来构建大规模汉语(甚至其它语言)篇章语料的便捷途径。
中文关键词: 篇章分析;篇章标注;双语投射;机器翻译;
英文摘要: After decades of development, natural language processing has been seen her effort on characters, words, phrases and sentences. Now she begins her trek on discourse. In the age of statistical natural language processing and corpus linguistics, researchers set about interpreting the structure of discourse by labeling the relations between discourse units with the help of machine learning methods. Of course, this could not happen without the publication of a large scale corpus-the Penn Discourse TreeBank (PDTB). However, the performance is not as welcome as the task itself. The discourse related issues are so intricate that the precision of implicit relation analysis which is a core problem of discourse parsing is less than 50 percent until now. This is why we say that "discourse interpreting is in its infancy". This project aims to make efforts on this problem. For Chinese, it is less lucky than English. There is even no annotated discourse corpus in satisfied size, which makes relevant research difficult even impossible. One of the reasons is that the annotation of discourse resource is complicated and time-consuming. In this project, we bend ourselves to explore the short cut for corpus building-annotating the Chinese discourse corpus by language projection. No matter regarding the corpus we promise to construc
英文关键词: discourse analysis;discourse annotation;bitext projection;machine translation;