项目名称: 基于翻译学习和核方法的中文模糊限制信息检测研究
项目编号: No.61272375
项目类型: 面上项目
立项/批准年度: 2013
项目学科: 自动化技术、计算机技术
项目作者: 周惠巍
作者单位: 大连理工大学
项目金额: 80万元
中文摘要: 作为信息抽取的一个重要环节,模糊限制信息检测旨在区分不确定信息与事实信息,避免将模糊限制信息作为事实信息用于信息抽取。近年来,英文模糊限制信息的检测已取得了阶段性研究成果,中文模糊限制语广泛用于中文各个领域,开展中文模糊限制信息检测的研究对于中文事实信息抽取具有重要意义。本项目首先针对生物医学文献,基于英文标注数据,采用翻译学习方法,训练中文模糊限制性句子识别模型,实现跨语言学习;然后采用迁移学习方法,将从中文生物医学文献学习获得的模糊限制性句子识别知识迁移至向其他领域,实现跨领域模糊限制性句子识别;设计并构建中文模糊限制信息语料库;抽取平面特征和句法、语义的结构化特征,使用多项式核和卷积树核的复合核,建立模糊限制信息范围检测模型。跨语言、跨领域的模糊限制性句子识别研究,将为自然语言处理中知识的迁移、推广提供理论基础和方法支撑;研究中文模糊限制信息检测将提高中文信息抽取的真实性和准确性。
中文关键词: 中文模糊限制信息检测;迁移学习;深度学习;核方法;表示学习
英文摘要: As an essential important step of information extraction, hedge detection is used to distinguish factual and uncertain information to avoid extracting speculative information as factual information. In recent years, extensive research has been done on automatic hedge detection from English texts. Meanwhile, hedges are widely used in Chinese texts of various fields, and the research on hedge detection from Chinese tests is, therefore, of essential importance in Chinese information extraction. In this work, translated learning methods are exploited for cross-language learning to identify Chinese hedge sentences based on English training data in the biomedical domain, transfer learning methods are exploited to transfer the knowledge extracted from Chinese biomedical domain to the other domain to solve cross-domain hedged sentences identification, Chinese hedge corpora are designed and constructed, and flat features and structured features of syntactic and semantic information are extracted to train a hedge scope detection model by the convolution tree kernel which consists of a polynomial kernel and a convolution tree kernel. In a word, the research of cross-language and cross-domain hedged sentences identification will provide both a theoretical foundation and specific methods for knowledge transferring and spread
英文关键词: Chinese hedge detection;transfer learning;deep learning;kernel methods;representation learning