项目名称: 文本多粒度关系抽取半监督自适应学习的研究
项目编号: No.61202135
项目类型: 青年科学基金项目
立项/批准年度: 2013
项目学科: 计算机科学学科
项目作者: 陈一飞
作者单位: 南京审计学院
项目金额: 24万元
中文摘要: 自动提取文本中的语义关系是文本挖掘和机器学习的重要研究内容。本项目旨在构建多粒度关系抽取的半监督自适应学习方法,可在拥有少量已标注样本和大量未标注样本的前提下,自动提取不同层次的多类别复杂相互关系,并将此学习方法应用于生物文本中的蛋白质相互关系抽取中。项目的主要研究内容包括:(1)构建一个改进的启发式快速半监督支持向量机学习方法,为高效、可规模化的多类分类半监督学习增加新的研究内容;(2)提出自适应分类模型,利用主动学习对半监督学习的优化,进一步提高半监督学习的性能和效率;(3)深入研究提取复杂文本中描述关系的全局和局部特征,融合多粒度的先验知识,提出统一的多粒度学习框架。同时,这个框架也可以应用于其他有大量未标注样本和多粒度信息抽取的领域。(4)将理论模型应用于蛋白质相互关系抽取的文本挖掘研究中,建立多粒度、多类别关系的自动提取系统,是解决生物学问题的新方法,具有很高理论和实用价值。
中文关键词: 文本挖掘;多粒度关系抽取;半监督学习;自适应学习;支持向量机
英文摘要: Automatic extraction of semantic relations in text is an important research content of the text mining and machine learning.This project aims to establish a new semi-supervised adaptive learning framework for multi-granularity relationship extraction, and applies to protein-protein interaction relation extraction in biomedical literature. The project's main research topics include: (1) To propose a theoretical framework of improved heuristic fast semi-supervised support vector machines,which adds new content for the efficient and large-scale semi-supervised learning; (2)To build a new multi-granularity adaptive classification model, which integrates active learning and semi-supervised learning and proposes a new adaptive learning theory framework; (3) To establish a multi-granularity multi-classifier to do relation extraction task. Moreover, this classifier can be applied to other application areas with a large number of unlabeled samples and high dimensional feature vectors; (4) To apply the proposed theoretical model in the protein-protein interaction relation extraction of text mining study. Make use of the integration of semi-supervised learning and active learning, extracting the rich, multi-granularity features based on natural language structure and biological domain information, a new machine learning f
英文关键词: text mining;multi-granularity relationship extraction;semi-supervised learning;adaptive learning;support vector machines