项目名称: 应用于面向问题的自动文摘任务的篇章分析关键技术研究
项目编号: No.60875042
项目类型: 面上项目
立项/批准年度: 2009
项目学科: 建筑科学
项目作者: 李素建
作者单位: 北京大学
项目金额: 28万元
中文摘要: 面向问题的自动文摘任务主要采用基于简单特征的句子抽取方法,存在摘要句信息冗余大、主题不连贯、不能很好地回答问题等缺陷。为了改善摘要性能,本申请提出引入篇章分析的理论和技术,从三个方面展开研究:第一,从语言学角度出发,并综合考虑篇章分析的可计算性,融合了修辞结构理论(RST)和语篇向心理论(CT),综合语句的修辞关系和话题转移关系定义了篇章标注体系,并提出了基于条件随机场的自动篇章标注方法;第二,为了使主题连贯均衡地分布在摘要中,提出了利用篇章关系构建主题模型,在语句层和词汇层建立了双层主题结构,在词汇层为主题构造层级概率生成模型,并通过语句的篇章关系调整主题的划分;第三,由于问题回答(QA)融入了自动文摘任务中,课题针对复杂观点型问题提出了判别和分析方法,以及文本语句观点倾向性分析的方法。本申请的研究成果,不仅为改进自动文摘系统提供了一种新的思路,而且为篇章分析等工作提供了基础资源和技术。
中文关键词: 面向问题的自动文摘技术;篇章分析;主题分析;观点倾向性分析
英文摘要: Most question-focused automatic summarization systems adopt extractive methods based on some simple features. Then, there exist some problems such as information redundancy,incoherence between topics, incapability to answer questions well and so on. In order to overcome these inadequacies, we propose to introduce the theories and techniques of text analysis into the summarization systems. Firstly, from the view of linguistics, with consideration of the calculability of text analysis, Rhetoric structure theory (RST) and centering theory (CT) are presented to define the discourse architecture, and to label the rhetoric relations and topic transition relations between sentences. Secondly, in order to make the topics distribute in the summary coherently and well-proportionedly, discourse relations are used to construct a two-level (including sentence level and lexical level) topic model. A hierarchical generation model of topics is constructed in the sentence level, and then the segmentation of topics is adjusted through discourse relations. Lastly, due that questions are fused into the summarization task, complex opinion questions are judged and analyzed, as well as the opinion of the sentences in the text. This research will not only contribute to improving the summarization systems,but provide the foundamental resources and techniques for the research of text anlysis.
英文关键词: Question-Focused Automatic Summarization;Text Analysis;Topic Analysis;Opnion Analysis