跨领域信息抽取方法及其在数字图书中的应用研究

项目名称： 跨领域信息抽取方法及其在数字图书中的应用研究

项目编号： No.61303097

项目类型： 青年科学基金项目

立项/批准年度： 2014

项目学科： 自动化技术、计算机技术

项目作者： 朱文浩

作者单位： 上海大学

项目金额： 23万元

中文摘要： 在信息爆炸的背景下，利用信息抽取从半结构化和非结构化文本中抽取结构化信息已经成为数据库、互联网、知识发现、语义网以及信息检索等多个领域的重要支撑技术之一。根据国内外研究现状，信息抽取技术研究的核心问题之一是如何设计抽取方法，使其能快速适应知识主题和文本形式的变化，实现跨领域抽取。课题的研究将针对该问题，通过分解领域相关特征，构建多层次特征模型，实现一种跨领域信息抽取方法。基于该方法，实现一种以反馈方式迭代优化特征选择和组合的信息抽取系统。在此基础上，课题还将以数字图书为应用对象，结合图书元数据和多层次特征模型，通过领域相似度计算获取更多具有代表性的样本，提高抽取效果。课题的研究有望在信息抽取特征模型理论研究方面取得突破，实现一种能快速适应领域变化的信息抽取方法以及一个有效的信息抽取系统。同时，研究应用面向数字图书，促进基于数字图书的知识服务的发展。在理论和应用两个方面均有显著的意义。

中文关键词： 信息抽取；跨领域；特征模型；；

英文摘要： With the movement of information explosion, information extraction, which can extract structured information out of semi-structured and unstructured text data, is becoming a significant technology of many research areas including database, World Wide Web, knowledge discovery, semantic web, information retrieval etc. The analysis on current research development shows that one of the key problems of information extraction is to design a cross domain extraction procedure that can adapt different domain topics and text formats. The research of this project is to give a corresponding solution to this problem by breaking down domain relevant features and establishing an extraction method based on multi-level feature model. Based on the feature model, the research of the project will also implement a system with a feedback mechanism to optimize the feature selection and combination. Additionally, by using book metadata and the feature model, the research will aim to improve the extraction performance by locating more representative samples through domain similarity calculation with respect to digital books. Hopefully, this research is going to form a theory on the feature modeling of cross domain information extraction, and to implement an information extraction method that can adapt domain variations and a effective e

英文关键词： information extraction；cross domain；feature model；；

成为VIP会员查看完整内容