项目名称: 问答式信息检索中信息抽取技术研究
项目编号: No.60803086
项目类型: 青年科学基金项目
立项/批准年度: 2009
项目学科: 金属学与金属工艺
项目作者: 杜永萍
作者单位: 北京工业大学
项目金额: 18万元
中文摘要: 问答式信息检索是新一代的搜索引擎,可接收自然语言描述的问题作为查询,在文档集中抽取问题的答案作为搜索引擎的返回结果,它更贴近用户的需求,是一具有广泛应用前景的研究领域。 本项目研究问答式信息检索中的核心技术,即智能化的信息抽取,包括通过模式学习与模式优化构建知识源;挖掘语义关联,基于机器学习方法建立蕴含关系识别模型;以及基于依存关系句法结构进行关联分析;最终,将不同的方法策略应用到Web问答式信息检索(海量信息问答式检索)与阅读理解任务(单文档问答式检索)中,实现答案信息抽取,检验其有效性。 本项目的研究建立了具备一定规模的模式知识库,共包含180种不同的问题类型,4261个答案模式;在语义蕴含关系识别研究中采用分类器Adaboost和SVM在TAC数据集中均取得了较优的性能,准确率可以达到60%以上,基于语义链的特征取得了较好的效果,进行t检验的结果表明系统性能得到显著的提高(p<0.05)。本项目的实施对发展新一代搜索具有一定的促进作用。
中文关键词: 问题回答;阅读理解;信息抽取;自然语言处理
英文摘要: Open Domain Question Answering (QA) represents an advanced application of natural language processing. The goal of QA is to retrieve answers to natural language questions rather than the documents as most information retrieval systems currently do. The technique of intelligent information extraction is studyed in the project and this is the kernal technology in question answering. The pattern knowledge resource has been constructed during the process of the pattern learning and optimization. Mining the semantic relation is important in QA and the model of entailment has been studied, which is based on the machine learning method. In addition, the relation analysis based on the syntactic structure also give the help for answer information extraction. Finally, different techniques are applied to both the field of multi-documents question answering and single document reading comprehension for information extraction. There are about 180 kinds of question type and 4261 answer patterns in the pattern knowledge resource. During the process of semantic entailment recognition, the classifier of Adaboost and SVM achieve the better performance on the TAC evaluation data set and the precision is above 60%. The t-test result shows that the lexical chain feature makes the system performance get the significant improvements(p<0.05). The implementation of the project will promote the development of the information retrieval technology.
英文关键词: Question Answering; Reading Comprehension; Information Extraction; Natural Language Processing